2.90:OCR Reader (Result Post Processor): Difference between revisions

Revision as of 13:42, 23 October 2020

The OCR Reader post processor selected on a Data Type's property panel.

OCR Reader is a Post Processing option for Data Type extractors. This allows you to define a rectangular region (called an "extraction zone" or just "zone") on a page relative to the Data Type's extraction result. Instead of just the original result, all text falling within the zone (obtained from the Recognize activity) will be returned as the result.

OCR Reader has some additional functionality as well. It has capability to exclude the original result using the Exclude Anchor property, returning everything else in the zone. It can take advantage of Grooper's "Auto Snap" functionality if lines are present on the document to draw the zone without any configuration. The text can be optionally re-processed with a different OCR Profile for highly targeted OCR results.

About

Highly structured documents organize information into a series of data fields. These fields will have a label identifying what the field contains, such as "Name", and a corresponding value, such as "John Doe". While the values for these fields will change from document to document, their position on the document will remain constant.

The OCR Reader result post-processor extracts data using this feature of document layouts.

As long as you can be reasonably assured the data you want to find will be in the same spot from document to document and you can use a Data Type to get close enough to where the field value is (extracting, for example, the field label), the OCR Reader can draw a rectangular region around the text you want to extract, returning the value inside.

Auto Snap - Snapping to Lines

The OCR Reader was designed with structured forms in mind who use lines to distinguish one field from another. Such as these three fields here.

Lines make it easy to distinguish the last name, first name, and middle initial fields.

The basic idea behind the OCR Reader post-processer is, first find something that identifies the field value you want to extract.

For example, the field label, "1. Last Name".

Grooper's "Auto Snap" functionality, will then expand the green extraction zone to the nearest detected lines. By default, OCR Reader will snap to lines if they are present, requiring no further configuration of the OCR Reader to draw the extraction zone.

All text falling inside the extraction zone will be returned by the Data Type.

Note: You can exclude the Data Type's original result used to draw the zone using the Exclude Anchor property. This would exclude the field label "1. Last Name", retuning only the last name "Cleaugh" in this case.

⚠	The lines must be detected from a Line Detection or Line Removal command during an Image Processing or Recognize activity before extracting the document. If Grooper doesn't know the lines are there, it won't be able to snap to the lines.

FYI

Auto Snap makes configuring OCR Reader very simple as long as the anchor value is encapsulated in a box whose lines can be detected. It makes it... a snap!

However, you can also manually define the extraction zone using the Region property. You can find more info on setting this up in the #How To section of this article.

OCR Reprocessing

Text inside the extraction zone can be reprocessed by a second OCR Profile. This is extremely useful on documents where the labels are easily extracted by one OCR Profile, but the values themselves are more accurately read by a different one. For example, one OCR engine may perform better on the font used to identify labels, but a second may do better at the one used for values. Grooper 2.80 and later comes installed with Transym and Tesseract OCR engines. Transym does a great job recognizing most fonts. However, it can do a poor job at recognizing the OCRA font. Tesseract has unique functionality to handle the OCRA font.

In the example below, the text reading "Wyatt" inside the extraction zone could be reprocessed by an OCR Profile using the Tesseract engine to accurately extract the name "Wyatt".

For more information, visit the Re-OCRing the Zone section of the How To tutorials in this article.

How To

Prereqs - Layout Data CollectionEstablish the Anchor ResultEnable the OCR ReaderExcluding the Anchor ResultDefining the Region without Auto Snap

Prereqs - Layout Data Collection

If you're going to take advantage of Auto Snap, you must first find and save that line location information. This can be done with a Line Detection or Line Removal IP command in an IP Profile. After applying that IP Profile during an Image Processing or Recognize activity, that data will be saved to the page's "LayoutData.json" file in Grooper.

Establish the Anchor Result

Write a regular expression pattern to get close to the zone you want to extract.
- In this case, 1\. Last Name
This puts us inside the box where the last name value "Cleugh" is located.

The very first thing you need to do is use the Data Type to return a result. This will be the starting point or "anchor" for the extraction zone. There are a variety of ways to produce an extraction result, using the Pattern property or child Data Format and Data Type extractors.

For this tutorial we are going to use a simple regular expression to locate the field label "1. Last Name", using the Pattern property of the Data Type. This result will be the anchor result for the OCR Reader post processor.

Create or select a Data Type object.
Select the Pattern property.
Press the ellipsis button to bring up the "Pattern Editor".

Enable the OCR Reader

On the parent Data Type, select the Post Processing property.
Using the dropdown list, select OCR Reader.

With detected lines present, and the Auto Snap functionality, with no further configuration, all text falling within the extraction zone is returned, including the anchor's result.
- Here the value, "1. Last NameCleugh"
Note: You may consider this result somewhat odd in that, the result's highlight on the document does not extend to the full boundaries of the box around the anchor label.

When configuring the OCR Reader, you may find the Output Full Region property helpful. Enabling this property will show the full extraction zone, giving you a better idea about what could be extracted from the zone.

Expand the OCR Reader sub-properties.
Select the Output Full Region property and change it to True.
This will display the full extraction zone, which you can clearly see extends to the full borders of the box.

Excluding the Anchor Result

Often, whatever text you used to hone in on the zone's location on the document is not what you actually want to extract. It's just the context you used to find the value you do want. In this case, the Exclude Anchor property can simply delete the anchor's text from the final result.

For example, what we really want to return is the last name "Cleugh" and not "1. Last NameCleugh"

Select the Exclude Anchor property and change it to True.
This will remove the anchor's result "1. Last Name", leaving us with just "Cleugh" as the result.

Defining the Region without Auto Snap

What if you don't have a document with lines? Without lines, Auto Snap won't have any lines to snap to!

Here, we have a simple pattern to match the label "2. First Name" as the anchor, but only the label is returned. If we want to use OCR Reader to establish the extraction zone without lines present, we will need to define the zone using the Region properties.

Expand the Region sub-properties.
The Left and Top properties will move the zone's location. The default starting position for the zone is the top-left corner of the anchor's result (However, you can change this using the Relative To property).
- Here, we've set the Top property to 0.15in.
This moves the extraction zone 0.15 inches down the page, starting at the top-left corner of the anchor's result.
- For the Top property, positive values will move the zone down the page. Negative values will move the zone up the page.
- For the Left property, positive values will move the zone to the right. Negative values will move the zone to the left.

This leaves us with a really tiny extraction zone. As well as defining the region's location, we must also define its size.

The Width and Height properties will control the zone's size.
- Here, we have set this to a 1 inch wide by 0.3 inch high zone.
Notice since we moved the zone using the Top and Left properties, the anchor "2. First Name" no longer falls inside the extraction zone. Only the name "Anissa" is extracted and returned.

@@ Line 162: / Line 162: @@
 |
 [[File:Ocr-reader-how-to-07.png]]
+|-
+|valign=top|
+# Expand the '''''Region''''' sub-properties.
+# The '''''Left''''' and '''''Top''''' properties will move the zone's ''location''.  The default starting position for the zone is the ''top-left corner'' of the anchor's result (However, you can change this using the '''''Relative To''''' property).
+#* Here, we've set the '''''Top''''' property to ''0.15in''.
+# This moves the extraction zone 0.15 inches down the page, starting at the top-left corner of the anchor's result.
+#* For the '''''Top''''' property, positive values will move the zone down the page.  Negative values will move the zone up the page.
+#* For the '''''Left''''' property, positive values will move the zone to the right.  Negative values will move the zone to the left.
+This leaves us with a really tiny extraction zone.  As well as defining the region's location, we must also define its ''size''.
+|
+[[File:Ocr-reader-how-to-08.png]]
+|-
+|valign=top|
+# The '''''Width''''' and '''''Height''''' properties will control the zone's size.
+#* Here, we have set this to a 1 inch wide by 0.3 inch high zone.
+# Notice since we moved the zone using the '''''Top''''' and '''''Left''''' properties, the anchor "2. First Name" no longer falls inside the extraction zone.  Only the name "Anissa" is extracted and returned.
+|
+[[File:Ocr-reader-how-to-09.png]]
 |}
 </tab>
+<tab name="Re>
 </tabs>