2.90:OCR Reader (Result Post Processor)

From Grooper Wiki

LEGACY TECHNOLOGY DETECTED!!

The OCR Reader post processor is still configurable for Data Type extractors in Grooper. However, this is a largely outdated way of doing things as of version 2021.

Now, it is more likely you would use the Read Zone extractor type to accomplish the same end goal.


The OCR Reader post processor selected on a Data Type's property panel.

OCR Reader is a Post Processing option for Data Type extractors. This allows you to define a rectangular region (called an "extraction zone" or just "zone") on a page relative to the Data Type's extraction result. Instead of just the original result, all text falling within the zone (obtained from the Recognize activity) will be returned as the result.

OCR Reader has some additional functionality as well. It has capability to exclude the original result using the Exclude Anchor property, returning everything else in the zone. It can take advantage of Grooper's "Auto Snap" functionality if lines are present on the document to draw the zone without any configuration. The text can be optionally re-processed with a different OCR Profile for highly targeted OCR results.

About

Highly structured documents organize information into a series of data fields. These fields will have a label identifying what the field contains, such as "Name", and a corresponding value, such as "John Doe". While the values for these fields will change from document to document, their position on the document will remain constant.

The OCR Reader result post-processor extracts data using this feature of document layouts.

As long as you can be reasonably assured the data you want to find will be in the same spot from document to document and you can use a Data Type to get close enough to where the field value is (extracting, for example, the field label), the OCR Reader can draw a rectangular region around the text you want to extract, returning the value inside.

Auto Snap - Snapping to Lines

The OCR Reader was designed with structured forms in mind who use lines to distinguish one field from another. Such as these three fields here.

Lines make it easy to distinguish the last name, first name, and middle initial fields.

The basic idea behind the OCR Reader post-processer is, first find something that identifies the field value you want to extract.

For example, the field label, "1. Last Name".

Grooper's "Auto Snap" functionality, will then expand the green extraction zone to the nearest detected lines. By default, OCR Reader will snap to lines if they are present, requiring no further configuration of the OCR Reader to draw the extraction zone.

All text falling inside the extraction zone will be returned by the Data Type.

  • Note: You can exclude the Data Type's original result used to draw the zone using the Exclude Anchor property. This would exclude the field label "1. Last Name", retuning only the last name "Cleaugh" in this case.

The lines must be detected from a Line Detection or Line Removal command during an Image Processing or Recognize activity before extracting the document.

If Grooper doesn't know the lines are there, it won't be able to snap to the lines.


FYI Auto Snap makes configuring OCR Reader very simple as long as the anchor value is encapsulated in a box whose lines can be detected. It makes it... a snap!

However, you can also manually define the extraction zone using the Region property. You can find more info on setting this up in the #How To section of this article.

OCR Reprocessing

Text inside the extraction zone can be reprocessed by a second OCR Profile.  This is extremely useful on documents where the labels are easily extracted by one OCR Profile, but the values themselves are more accurately read by a different one.  For example, one OCR engine may perform better on the font used to identify labels, but a second may do better at the one used for values.  Grooper 2.80 and later comes installed with Transym and Tesseract OCR engines.  Transym does a great job recognizing most fonts.  However, it can do a poor job at recognizing the OCRA font.  Tesseract has unique functionality to handle the OCRA font.

In the example below, the text reading "Wyatt" inside the extraction zone could be reprocessed by an OCR Profile using the Tesseract engine to accurately extract the name "Wyatt".

For more information, visit the Re-OCRing the Zone section of the How To tutorials in this article.


How To

If you would like to follow along with this tutorial, you may download the zip file below and import it into your Grooper Repository. This file contains a batch with the documents used in this tutorial and configured Grooper assets.

Prereqs - Layout Data Collection

If you're going to take advantage of Auto Snap, you must first find and save that line location information. This can be done with a Line Detection or Line Removal IP command in an IP Profile. After applying that IP Profile during an Image Processing or Recognize activity, that data will be saved to the page's "LayoutData.json" file in Grooper.

Establish the Anchor Result

  1. Write a regular expression pattern to get close to the zone you want to extract.
    • In this case, 1\. Last Name
  2. This puts us inside the box where the last name value "Cleugh" is located.

The very first thing you need to do is use the Data Type to return a result. This will be the starting point or "anchor" for the extraction zone. There are a variety of ways to produce an extraction result, using the Pattern property or child Data Format and Data Type extractors.

For this tutorial we are going to use a simple regular expression to locate the field label "1. Last Name", using the Pattern property of the Data Type. This result will be the anchor result for the OCR Reader post processor.

  1. Create or select a Data Type object.
  2. Select the Pattern property.
  3. Press the ellipsis button to bring up the "Pattern Editor".

Enable the OCR Reader

  1. On the parent Data Type, select the Post Processing property.
  2. Using the dropdown list, select OCR Reader.

  1. With detected lines present, and the Auto Snap functionality, with no further configuration, all text falling within the extraction zone is returned, including the anchor's result.
    • Here the value, "1. Last NameCleugh"
  2. Note: You may consider this result somewhat odd in that, the result's highlight on the document does not extend to the full boundaries of the box around the anchor label.

When configuring the OCR Reader, you may find the Output Full Region property helpful. Enabling this property will show the full extraction zone, giving you a better idea about what could be extracted from the zone.

  1. Expand the OCR Reader sub-properties.
  2. Select the Output Full Region property and change it to True.
  3. This will display the full extraction zone, which you can clearly see extends to the full borders of the box.

Excluding the Anchor Result

Often, whatever text you used to hone in on the zone's location on the document is not what you actually want to extract. It's just the context you used to find the value you do want. In this case, the Exclude Anchor property can simply delete the anchor's text from the final result.

For example, what we really want to return is the last name "Cleugh" and not "1. Last NameCleugh"

  1. Select the Exclude Anchor property and change it to True.
  2. This will remove the anchor's result "1. Last Name", leaving us with just "Cleugh" as the result.

Defining the Region without Auto Snap

What if you don't have a document with lines? Without lines, Auto Snap won't have any lines to snap to!

Here, we have a simple pattern to match the label "2. First Name" as the anchor, but only the label is returned. If we want to use OCR Reader to establish the extraction zone without lines present, we will need to define the zone using the Region properties.

  1. Expand the Region sub-properties.
  2. The Left and Top properties will move the zone's location. The default starting position for the zone is the top-left corner of the anchor's result (However, you can change this using the Relative To property).
    • Here, we've set the Top property to 0.15in.
  3. This moves the extraction zone 0.15 inches down the page, starting at the top-left corner of the anchor's result.
    • For the Top property, positive values will move the zone down the page. Negative values will move the zone up the page.
    • For the Left property, positive values will move the zone to the right. Negative values will move the zone to the left.

This leaves us with a really tiny extraction zone. As well as defining the region's location, we must also define its size.

  1. The Width and Height properties will control the zone's size.
    • Here, we have set this to a 1 inch wide by 0.3 inch high zone.
  2. Notice since we moved the zone using the Top and Left properties, the anchor "2. First Name" no longer falls inside the extraction zone. Only the name "Anissa" is extracted and returned.

Re-OCRing the Zone

Prereqs - Create a Secondary OCR Profile

The document used in this tutorial uses a specialized font for the field values, the OCR-A font. While this font was originally created with to OCR documents, modern OCR engines often have a hard time recognizing this font. However, the Tesseract engine has the capability to train fonts, allowing you to improve the OCR accuracy of non-standard fonts.

Training data for the OCR-A font ships with all Grooper installs (post version 2.72). Here we have a very simple secondary OCR Profile, using Tesseract OCR for the OCR Engine, and the OCRA font checked as a Special Fonts option.

  • The main OCR results were supplied by an OCR Profile using the Transym OCR 4.0 engine.

Configure the OCR Reader

Here, we have a Data Type using the "Docket Number" label as the anchor result. The extraction zone is drawn correctly, snapping to the lines around the label. However, the result is wrong. The labels on this document use a typical font but the values are all in that OCR-A font.

This should return the docket number "055-761349". However, the main OCR Profile, using Transym, did not recognize the number well, returning "055 - 7613 L 9".

Assign the Secondary OCR Profile

To reprocess this portion of the document with a different OCR Profile, use the OCR Profile property to assign a secondary OCR Profile.

  1. Here, we have set the OCR Profile to the profile using the Tesseract OCR engine.
  2. OCR runs on the portion of the document covered by the extraction zone.
  3. The new OCR results are returned by the extractor.
    • In this case, the secondary OCR profile provides the accurate result.

Data Instancing: The Value Extractor Property

By returning an extraction zone, we've also returned a new data instance of the document. We can use the Value Extractor property of the OCR Reader to parse information inside this data instance. The extractor or pattern set here will only execute against the text inside the extraction zone.

Here, we have used the pattern ^[^@]+ to return just the local part of the email address.

  1. Use the Value Extractor property to assign an Internal or External extractor.
  2. The extractor runs only against the data instance. It executes against the text returned by the OCR Reader.
    • Here this returns the local part of the email address "cfears5" instead of the full address "cfears5@sitemeter.com"

However, be aware the Value Extractor here will return all matches concatenated together as a single result.

The carrot symbol (^) is a special character in regular expression that matches the beginning of a string. We included this at the beginning of our pattern to anchor our [^@]+ pattern to only return results on the left side of the "at" symbol. The full expression ^[^@]+ will return any number of characters that are not "at" symbols only if they are located at the begining of the string (which in this case is the data instance returned by the OCR Reader.

What if we didn't have that begining of string character and just used [^@]+? That pattern would match both "cfears" and "sitemeter.com"

The two results are combined, returning a single result "cfears5sitemeter.com".

It may be the case you do want to return all values as a concatenated string, in which case this functionality is perfect for you.

You may also want to insert some kind of character between each result, like a space or a comma or another character (or combination of charcters).

  1. You can use the Value Separator property to insert a character for each result.
    • For example, entering the pipe character (|) here will create a pipe-delimited list of results.
  2. You can see here, this changes the output to "cfears5|sitemeter.com"

A Word of Caution

In a lot of ways, the Read Zone option for a Data Field's Value Extractor and the OCR Reader Post Processing option for Data Types are very similar to each other. Both can use text anchors and extraction zones to return data inside the drawn boundary on the page.

Both have a Value Extractor property to parse data within the zone's data instance as well. However, the results are collated and returned in one drastically different way.

  • Read Zone will return only the first match returned.
  • OCR Reader will return all matches concatenated together.

The differences are not necessarily good or bad. It just depends on your needs which one you will want to use.

For an in-depth explanation of the differences, visit the Read Zone article's section on the Value Extractor property.

Version Differences

The OCR Reader result processor is a new configurable property available to Data Types as of version 2.72. Prior to version 2.72, the capability of reprocessing OCR on a region of a document was available by configuring a Data Element Profile in a Document Type object. OCR Reader provides a much simpler configuration to obtain the same result. Furthermore, since the result is returned to a Data Type, the result can be used any time a Data Type extractor is used in Grooper, not just to populate a Data Field in a Data Model.

There are no major differences to report from version 2.72 to 2.90.