2.90:OCR Reader (Result Post Processor)
‼ |
LEGACY TECHNOLOGY DETECTED!! The OCR Reader post processor is still configurable for Data Type extractors in Grooper. However, this is a largely outdated way of doing things as of version 2021. Now, it is more likely you would use the Read Zone extractor type to accomplish the same end goal. |

OCR Reader is a Post Processing option for Data Type extractors. This allows you to define a rectangular region (called an "extraction zone" or just "zone") on a page relative to the Data Type's extraction result. Instead of just the original result, all text falling within the zone (obtained from the Recognize activity) will be returned as the result.
OCR Reader has some additional functionality as well. It has capability to exclude the original result using the Exclude Anchor property, returning everything else in the zone. It can take advantage of Grooper's "Auto Snap" functionality if lines are present on the document to draw the zone without any configuration. The text can be optionally re-processed with a different OCR Profile for highly targeted OCR results.
About
Highly structured documents organize information into a series of data fields. These fields will have a label identifying what the field contains, such as "Name", and a corresponding value, such as "John Doe". While the values for these fields will change from document to document, their position on the document will remain constant. |
The OCR Reader result post-processor extracts data using this feature of document layouts. As long as you can be reasonably assured the data you want to find will be in the same spot from document to document and you can use a Data Type to get close enough to where the field value is (extracting, for example, the field label), the OCR Reader can draw a rectangular region around the text you want to extract, returning the value inside. |
Auto Snap - Snapping to Lines
The OCR Reader was designed with structured forms in mind who use lines to distinguish one field from another. Such as these three fields here. Lines make it easy to distinguish the last name, first name, and middle initial fields. |
|
The basic idea behind the OCR Reader post-processer is, first find something that identifies the field value you want to extract. For example, the field label, "1. Last Name". |
|
Grooper's "Auto Snap" functionality, will then expand the green extraction zone to the nearest detected lines. By default, OCR Reader will snap to lines if they are present, requiring no further configuration of the OCR Reader to draw the extraction zone. All text falling inside the extraction zone will be returned by the Data Type.
|
⚠ | The lines must be detected from a Line Detection or Line Removal command during an Image Processing or Recognize activity before extracting the document.
If Grooper doesn't know the lines are there, it won't be able to snap to the lines. |
FYI | Auto Snap makes configuring OCR Reader very simple as long as the anchor value is encapsulated in a box whose lines can be detected. It makes it... a snap!
However, you can also manually define the extraction zone using the Region property. You can find more info on setting this up in the #How To section of this article. |
OCR Reprocessing
Text inside the extraction zone can be reprocessed by a second OCR Profile. This is extremely useful on documents where the labels are easily extracted by one OCR Profile, but the values themselves are more accurately read by a different one. For example, one OCR engine may perform better on the font used to identify labels, but a second may do better at the one used for values. Grooper 2.80 and later comes installed with Transym and Tesseract OCR engines. Transym does a great job recognizing most fonts. However, it can do a poor job at recognizing the OCRA font. Tesseract has unique functionality to handle the OCRA font.
In the example below, the text reading "Wyatt" inside the extraction zone could be reprocessed by an OCR Profile using the Tesseract engine to accurately extract the name "Wyatt".

For more information, visit the Re-OCRing the Zone section of the How To tutorials in this article.
How To
If you would like to follow along with this tutorial, you may download the zip file below and import it into your Grooper Repository. This file contains a batch with the documents used in this tutorial and configured Grooper assets.
Prereqs - Layout Data Collection
If you're going to take advantage of Auto Snap, you must first find and save that line location information. This can be done with a Line Detection or Line Removal IP command in an IP Profile. After applying that IP Profile during an Image Processing or Recognize activity, that data will be saved to the page's "LayoutData.json" file in Grooper.
Establish the Anchor Result
- Write a regular expression pattern to get close to the zone you want to extract.
- In this case,
1\. Last Name
- In this case,
- This puts us inside the box where the last name value "Cleugh" is located.
The very first thing you need to do is use the Data Type to return a result. This will be the starting point or "anchor" for the extraction zone. There are a variety of ways to produce an extraction result, using the Pattern property or child Data Format and Data Type extractors. For this tutorial we are going to use a simple regular expression to locate the field label "1. Last Name", using the Pattern property of the Data Type. This result will be the anchor result for the OCR Reader post processor.
|
|
Enable the OCR Reader
|
|
|
|
When configuring the OCR Reader, you may find the Output Full Region property helpful. Enabling this property will show the full extraction zone, giving you a better idea about what could be extracted from the zone.
|
Excluding the Anchor Result
Often, whatever text you used to hone in on the zone's location on the document is not what you actually want to extract. It's just the context you used to find the value you do want. In this case, the Exclude Anchor property can simply delete the anchor's text from the final result. For example, what we really want to return is the last name "Cleugh" and not "1. Last NameCleugh"
|
Defining the Region without Auto Snap
What if you don't have a document with lines? Without lines, Auto Snap won't have any lines to snap to! Here, we have a simple pattern to match the label "2. First Name" as the anchor, but only the label is returned. If we want to use OCR Reader to establish the extraction zone without lines present, we will need to define the zone using the Region properties. |
|
This leaves us with a really tiny extraction zone. As well as defining the region's location, we must also define its size. |
|
|
Re-OCRing the Zone
Prereqs - Create a Secondary OCR Profile
The document used in this tutorial uses a specialized font for the field values, the OCR-A font. While this font was originally created with to OCR documents, modern OCR engines often have a hard time recognizing this font. However, the Tesseract engine has the capability to train fonts, allowing you to improve the OCR accuracy of non-standard fonts. Training data for the OCR-A font ships with all Grooper installs (post version 2.72). Here we have a very simple secondary OCR Profile, using Tesseract OCR for the OCR Engine, and the OCRA font checked as a Special Fonts option.
|
Configure the OCR Reader
Here, we have a Data Type using the "Docket Number" label as the anchor result. The extraction zone is drawn correctly, snapping to the lines around the label. However, the result is wrong. The labels on this document use a typical font but the values are all in that OCR-A font. This should return the docket number "055-761349". However, the main OCR Profile, using Transym, did not recognize the number well, returning "055 - 7613 L 9". |
Assign the Secondary OCR Profile
To reprocess this portion of the document with a different OCR Profile, use the OCR Profile property to assign a secondary OCR Profile.
|
Data Instancing: The Value Extractor Property
By returning an extraction zone, we've also returned a new data instance of the document. We can use the Value Extractor property of the OCR Reader to parse information inside this data instance. The extractor or pattern set here will only execute against the text inside the extraction zone. Here, we have used the pattern
|
|
However, be aware the Value Extractor here will return all matches concatenated together as a single result. The carrot symbol ( What if we didn't have that begining of string character and just used The two results are combined, returning a single result "cfears5sitemeter.com". |
|
It may be the case you do want to return all values as a concatenated string, in which case this functionality is perfect for you. You may also want to insert some kind of character between each result, like a space or a comma or another character (or combination of charcters).
|
A Word of Caution
In a lot of ways, the Read Zone option for a Data Field's Value Extractor and the OCR Reader Post Processing option for Data Types are very similar to each other. Both can use text anchors and extraction zones to return data inside the drawn boundary on the page.
Both have a Value Extractor property to parse data within the zone's data instance as well. However, the results are collated and returned in one drastically different way.
- Read Zone will return only the first match returned.
- OCR Reader will return all matches concatenated together.
The differences are not necessarily good or bad. It just depends on your needs which one you will want to use.
For an in-depth explanation of the differences, visit the Read Zone article's section on the Value Extractor property.
Version Differences
The OCR Reader result processor is a new configurable property available to Data Types as of version 2.72. Prior to version 2.72, the capability of reprocessing OCR on a region of a document was available by configuring a Data Element Profile in a Document Type object. OCR Reader provides a much simpler configuration to obtain the same result. Furthermore, since the result is returned to a Data Type, the result can be used any time a Data Type extractor is used in Grooper, not just to populate a Data Field in a Data Model.
There are no major differences to report from version 2.72 to 2.90.