|
Tag: Redirect target changed |
| (5 intermediate revisions by 2 users not shown) |
| Line 1: |
Line 1: |
| [[image:ocr reader.png|frame|The OCR Reader post processor selected on a Data Type's property panel.]] | | #REDIRECT [[2.90:OCR Reader (Result Post Processor)]] |
| | |
| <blockquote style="font-size:14pt">
| |
| The '''OCR Reader''' post processor allows you to run additional OCR on a region nearby a label (which has been returned as the result of a Data Type extractor) and return the reprocessed text.
| |
| </blockquote>
| |
| | |
| This is especially useful on documents where data is printed in a special font, or even when the label and value fonts are different entirely.
| |
| | |
| This region can be relative to the initial Data Type extractor's result, or be configured to automatically "snap" to a bounding box.
| |
| | |
| <br clear = all>
| |
| | |
| == Version Differences ==
| |
| | |
| The OCR Reader result processor is a new configurable property available to Data Types as of version 2.72. Prior to version 2.72, the capability of reprocessing OCR on a region of a document was available by configuring a Data Element Profile in a Document Type object. Furthermore, since the result is returned to a Data Type, this data could be used any time an extractor is used in Grooper, not just to populate a field in a data model.
| |
| | |
| == Example ==
| |
| | |
| We want to extract key values from a form. The problem is that this form contains mixed fonts, and the values we want are displayed in the OCR-A font, which is sometimes troublesome for standard OCR engines.
| |
| | |
| [[image:1555438762302-330.png|center|900px]]
| |
| | |
| The idea for this scenario is to first extract the label for the value we want using a Data Type extractor <code>2. First Name</code>, and then set the "Post Processing" property of that Data Type to run a special OCR Profile designed to extract the value associated with the label <code>Benjamin</code>.
| |
| | |
| === Steps ===
| |
| | |
| <tabs>
| |
| <tab name="Step 1">
| |
| ===== Create the Data Type =====
| |
| First, we want to create the "Data Type" that will extract the label with the following "Pattern".
| |
| | |
| <code>(\d\.\s)?first name</code>
| |
| | |
| | |
| [[image:ocrpp001.png|center|900px]]
| |
| | |
| | |
| </tab>
| |
| <tab name="Step 2">
| |
| ===== Enable the Post Processor =====
| |
| In the "Output" section of the Data Type, we'll set the "Post Processing" property to "OCR Reader".
| |
| | |
| | |
| [[image:ocrpp002.png|center|900px]]
| |
| | |
| | |
| </tab>
| |
| <tab name="Step 3">
| |
| ===== Configure the Post Processor =====
| |
| Once we've chosen "OCR Reader", we can expand it to reveal its configurable properties.
| |
| * Set the '''OCR Profile''' to a profile designed specifically to deal with OCR-A fonts.
| |
| * Set '''Auto Snap Distance''' to "1.5in".
| |
| * Set '''Auto Snap Margin''' to "1pt, 12pt, 2pt, 1pt".
| |
| * Set '''Output Full Region''' to "True".
| |
| | |
| | |
| [[image:ocrpp003.png|center|900px]]
| |
| | |
| | |
| </tab>
| |
| <tab name="Step 4">
| |
| ===== Test Extraction =====
| |
| Once we Save and Run Extraction, the "Post Processor" runs after the initial pattern (the one that finds the label), OCRs the region we defined, and outputs our value: <code>Benjamin</code>.
| |
| | |
| | |
| [[image:ocrpp004.png|center|900px]]
| |
| </tab>
| |
| </tabs>
| |
| | |
| == Properties ==
| |
| | |
| {|cellpadding="10" cellspacing="5"
| |
| |-style="background-color:#ddf5f5"
| |
| |OCR Profile||The OCR Profile to use for character extraction.
| |
| |-style="background-color:#ddf5f5"
| |
| |Region||Specifies a region, relative to each output instance, where OCR should be performed.
| |
| |-style="background-color:#ddf5f5"
| |
| |Auto Snap Distance||Specifies the maximum distance for an auto snap operation, which automatically aligns the edges of the zone to lines on the document.
| |
| |-style="background-color:#ddf5f5"
| |
| |Auto Snap Margin||When the auto snap feature is in use, specifies an additional amount to shrink the zone on each edge.
| |
| |-style="background-color:#ddf5f5"
| |
| |Value Extractor||An optional extractor to be executed against the OCR content.
| |
| |-style="background-color:#ddf5f5"
| |
| |Line Separator||When capturing multiple lines of text, specifies how line breaks will be represented in the output.
| |
| |-style="background-color:#ddf5f5"
| |
| |Output Full Region||Specifies whether the highlight region of each output instance will reflect the full OCR area, or only the area containing text.
| |