2.90:OCR Reader (Result Post Processor): Difference between revisions

Revision as of 11:43, 23 October 2020

The OCR Reader post processor selected on a Data Type's property panel.

OCR Reader is a Post Processing option for Data Type extractors. This allows you to define a rectangular region (called an "extraction zone" or just "zone") on a page relative to the Data Type's extraction result. Instead of just the original result, all text falling within the zone (obtained from the Recognize activity) will be returned as the result.

OCR Reader has some additional functionality as well. It has capability to exclude the original result using the Exclude Anchor property, returning everything else in the zone. It can take advantage of Grooper's "Auto Snap" functionality if lines are present on the document to draw the zone without any configuration. The text can be optionally re-processed with a different OCR Profile for highly targeted OCR results.

About

Highly structured documents organize information into a series of data fields. These fields will have a label identifying what the field contains, such as "Name", and a corresponding value, such as "John Doe". While the values for these fields will change from document to document, their position on the document will remain constant.

The OCR Reader result post-processor extracts data using this feature of document layouts.

As long as you can be reasonably assured the data you want to find will be in the same spot from document to document and you can use a Data Type to get close enough to where the field value is (extracting, for example, the field label), the OCR Reader can draw a rectangular region around the text you want to extract, returning the value inside.

Auto Snap - Snapping to Lines

The OCR Reader was designed with structured forms in mind who use lines to distinguish one field from another. Such as these three fields here.

Lines make it easy to distinguish the last name, first name, and middle initial fields.

The basic idea behind the OCR Reader post-processer is, first find something that identifies the field value you want to extract.

For example, the field label, "1. Last Name".

Grooper's "Auto Snap" functionality, will then expand the green extraction zone to the nearest detected lines. By default, OCR Reader will snap to lines if they are present, requiring no further configuration of the OCR Reader to draw the extraction zone.

All text falling inside the extraction zone will be returned by the Data Type.

Note: You can exclude the Data Type's original result used to draw the zone using the Exclude Anchor property. This would exclude the field label "1. Last Name", retuning only the last name "Cleaugh" in this case.

⚠	The lines must be detected from a Line Detection or Line Removal command during an Image Processing or Recognize activity before extracting the document. If Grooper doesn't know the lines are there, it won't be able to snap to the lines.

FYI

Auto Snap makes configuring OCR Reader very simple as long as the anchor value is encapsulated in a box whose lines can be detected. It makes it... a snap!

However, you can also manually define the extraction zone using the Region property. You can find more info on setting this up in the #How To section of this article.

OCR Reprocessing

Text inside the extraction zone can be reprocessed by a second OCR Profile. This is extremely useful on documents where the labels are easily extracted by one OCR Profile, but the values themselves are more accurately read by a different one. For example, one OCR engine may perform better on the font used to identify labels, but a second may do better at the one used for values. Grooper 2.80 and later comes installed with Transym and Tesseract OCR engines. Transym does a great job recognizing most fonts. However, it can do a poor job at recognizing the OCRA font. Tesseract has unique functionality to handle the OCRA font.

In the example below, the text reading "Wyatt" inside the extraction zone could be reprocessed by an OCR Profile using the Tesseract engine to accurately extract the name "Wyatt".

For more information, visit the Re-OCRing the Zone section of the How To tutorials in this article.

@@ Line 65: / Line 65: @@
 However, you can also manually define the extraction zone using the '''''Region''''' property.  You can find more info on setting this up in the [[#How To]] section of this article.
 |}
+=== OCR Reprocessing ===
+Text inside the extraction zone can be reprocessed by a second OCR Profile.  This is extremely useful on documents where the labels are easily extracted by one OCR Profile, but the values themselves are more accurately read by a different one.  For example, one OCR engine may perform better on the font used to identify labels, but a second may do better at the one used for values.  Grooper 2.80 and later comes installed with Transym and Tesseract OCR engines.  Transym does a great job recognizing most fonts.  However, it can do a poor job at recognizing the OCRA font.  Tesseract has unique functionality to handle the OCRA font.
+In the example below, the text reading "Wyatt" inside the extraction zone could be reprocessed by an OCR Profile using the Tesseract engine to accurately extract the name "Wyatt".
+[[file:1572984549501-392.png|center|900px]]
+For more information, visit the [[#Re-OCRing the Zone|Re-OCRing the Zone]] section of the How To tutorials in this article.