OCR Reader (Result Post Processor): Difference between revisions

← Older edit

Latest revision as of 13:37, 2 January 2024

Redirect to:

2.90:OCR Reader (Result Post Processor)

@@ Line 1: / Line 1: @@
-[[image:ocr reader.png|frame|The OCR Reader post processor selected on a Data Type's property panel.]]
+#REDIRECT [[2.90:OCR Reader (Result Post Processor)]]
-<blockquote style="font-size:14pt">
-The '''OCR Reader''' post processor allows you to run additional OCR on a region nearby a label (which has been returned as the result of a Data Type extractor) and return the reprocessed text.
-</blockquote>
-This is especially useful on documents where data is printed in a special font, or even when the label and value fonts are different entirely.
-This region can be relative to the initial Data Type extractor's result, or be configured to automatically "snap" to a bounding box.
-<br clear = all>
-== Version Differences ==
-The OCR Reader result processor is a new configurable property available to Data Types as of version 2.72.  Prior to version 2.72, the capability of reprocessing OCR on a region of a document was available by configuring a Data Element Profile in a Document Type object.  Furthermore, since the result is returned to a Data Type, this data could be used any time an extractor is used in Grooper, not just to populate a field in a data model.
-== Example ==
-We want to extract key values from a form.  The problem is that this form contains mixed fonts, and the values we want are displayed in the OCR-A font, which is sometimes troublesome for standard OCR engines.
-[[image:1555438762302-330.png|center|900px]]
-The idea for this scenario is to first extract the label for the value we want using a Data Type extractor <code>2. First Name</code>, and then set the "Post Processing" property of that Data Type to run a special OCR Profile designed to extract the value associated with the label <code>Benjamin</code>.
-=== Steps ===
-<tabs>
-<tab name="Step 1">
-===== Create the Data Type =====
-First, we want to create the "Data Type" that will extract the label with the following "Pattern".
-<code>(\d\.\s)?first name</code>
-[[image:ocrpp001.png|center|900px]]
-</tab>
-<tab name="Step 2">
-===== Enable the Post Processor =====
-In the "Output" section of the Data Type, we'll set the "Post Processing" property to "OCR Reader".
-[[image:ocrpp002.png|center|900px]]
-</tab>
-<tab name="Step 3">
-===== Configure the Post Processor =====
-Once we've chosen "OCR Reader", we can expand it to reveal its configurable properties.
-* Set the '''OCR Profile''' to a profile designed specifically to deal with OCR-A fonts.
-* Set '''Auto Snap Distance''' to "1.5in".
-* Set '''Auto Snap Margin''' to "1pt, 12pt, 2pt, 1pt".
-* Set '''Output Full Region''' to "True".
-[[image:ocrpp003.png|center|900px]]
-</tab>
-<tab name="Step 4">
-===== Test Extraction =====
-Once we Save and Run Extraction, the "Post Processor" runs after the initial pattern (the one that finds the label), OCRs the region we defined, and outputs our value: <code>Benjamin</code>.
-[[image:ocrpp004.png|center|900px]]
-</tab>
-</tabs>
-== Properties ==
-{|cellpadding="10" cellspacing="5"
-|-style="background-color:#ddf5f5"
-|OCR Profile||The OCR Profile to use for character extraction.
-|-style="background-color:#ddf5f5"
-|Region||Specifies a region, relative to each output instance, where OCR should be performed.
-|-style="background-color:#ddf5f5"
-|Auto Snap Distance||Specifies the maximum distance for an auto snap operation, which automatically aligns the edges of the zone to lines on the document.
-|-style="background-color:#ddf5f5"
-|Auto Snap Margin||When the auto snap feature is in use, specifies an additional amount to shrink the zone on each edge.
-|-style="background-color:#ddf5f5"
-|Value Extractor||An optional extractor to be executed against the OCR content.
-|-style="background-color:#ddf5f5"
-|Line Separator||When capturing multiple lines of text, specifies how line breaks will be represented in the output.
-|-style="background-color:#ddf5f5"
-|Output Full Region||Specifies whether the highlight region of each output instance will reflect the full OCR area, or only the area containing text.