OCR Reader (Result Post Processor): Difference between revisions

← Older edit

Latest revision as of 12:37, 2 January 2024

Redirect to:

2.90:OCR Reader (Result Post Processor)

@@ Line 1: / Line 1: @@
-[[image:ocr reader2.png|frame|The OCR Reader post processor selected on a Data Type's property panel.]]
+#REDIRECT [[2.90:OCR Reader (Result Post Processor)]]
-<blockquote style="font-size:14pt">
-The '''OCR Reader''' post processor allows you to run additional OCR on a region nearby a label (which has been returned as the result of a Data Type extractor) and return the reprocessed text.
-</blockquote>
-This is especially useful on documents where data is printed in a special font, or even when the label and value fonts are different entirely.
-This region can be relative to the initial Data Type extractor's result, or be configured to automatically "snap" to a bounding box.
-== Example ==
-We want to extract key values from a form.  The problem is that this form contains mixed fonts, and the values we want are displayed in the OCR-A font, which is sometimes troublesome for standard OCR engines.
-[[image:1555438762302-330.png|center|900px]]
-The idea for this scenario is to first extract the label for the value we want using a Data Type extractor <code>2. First Name</code>, and then set the "Post Processing" property of that Data Type to run a special OCR Profile designed to extract the value associated with the label <code>Benjamin</code>.
-=== Steps ===
-<tabs>
-<tab name="Step 1">
-===== Create the Data Type =====
-First, we want to create the "Data Type" that will extract the label with the following "Pattern".
-<code>(\d\.\s)?first name</code>
-[[image:ocrpp001.png|center|900px]]
-</tab>
-<tab name="Step 2">
-===== Enable the Post Processor =====
-In the "Output" section of the Data Type, we'll set the "Post Processing" property to "OCR Reader".
-[[image:ocrpp002.png|center|900px]]
-</tab>
-<tab name="Step 3">
-===== Configure the Post Processor =====
-Once we've chosen "OCR Reader", we can expand it to reveal its configurable properties.
-* Set the '''OCR Profile''' to a profile designed specifically to deal with OCR-A fonts.
-* Set '''Auto Snap Distance''' to "1.5in".
-* Set '''Auto Snap Margin''' to "1pt, 12pt, 2pt, 1pt".
-* Set '''Output Full Region''' to "True".
-[[image:ocrpp003.png|center|900px]]
-</tab>
-<tab name="Step 4">
-===== Test Extraction =====
-Once we Save and Run Extraction, the "Post Processor" runs after the initial pattern (the one that finds the label), OCRs the region we defined, and outputs our value: <code>Benjamin</code>.
-[[image:ocrpp004.png|center|900px]]
-</tab>
-</tabs>
-=== The Value Extractor Property: A Word of Caution ===
-In a lot of ways, the ''Read Zone'' option for a '''Data Field's''' '''''Value Extractor''''' and the ''OCR Reader'' '''''Post Processing''''' option for '''Data Types''' are very similar to each other. Both can use text anchors and extraction zones to return data inside the drawn boundary on the page.
-Both have a '''''Value Extractor''''' property to parse data within the zone's data instance as well. However, the results are collated and returned in one drastically different way.
-* Read Zone will return only the first match returned.
-* OCR Reader will return all matches concatenated together.
-The differences are not necessarily good or bad. It just depends on your needs which one you will want to use.
-For an in-depth explaination of the differences, visit the [[Read Zone#Data Instancing: Using Read Zone's Value Extractor Property|Read Zone article's section on the Value Extractor property]].
-== Version Differences ==
-The OCR Reader result processor is a new configurable property available to Data Types as of version 2.72.  Prior to version 2.72, the capability of reprocessing OCR on a region of a document was available by configuring a Data Element Profile in a Document Type object.  Furthermore, since the result is returned to a Data Type, this data could be used any time an extractor is used in Grooper, not just to populate a field in a data model.
-== Properties ==
-{|cellpadding="10" cellspacing="5"
-|-style="background-color:#ddf5f5"
-|OCR Profile||The OCR Profile to use for character extraction.
-|-style="background-color:#ddf5f5"
-|Region||Specifies a region, relative to each output instance, where OCR should be performed.
-|-style="background-color:#ddf5f5"
-|Auto Snap Distance||Specifies the maximum distance for an auto snap operation, which automatically aligns the edges of the zone to lines on the document.
-|-style="background-color:#ddf5f5"
-|Auto Snap Margin||When the auto snap feature is in use, specifies an additional amount to shrink the zone on each edge.
-|-style="background-color:#ddf5f5"
-|Value Extractor||An optional extractor to be executed against the OCR content.
-|-style="background-color:#ddf5f5"
-|Line Separator||When capturing multiple lines of text, specifies how line breaks will be represented in the output.
-|-style="background-color:#ddf5f5"
-|Output Full Region||Specifies whether the highlight region of each output instance will reflect the full OCR area, or only the area containing text.