2.80:Layered OCR (OCR Engine): Difference between revisions

Revision as of 15:18, 9 July 2020

Layered OCR enables you to run secondary OCR Profiles on a single page. The OCR results from these secondary OCR Profiles are merged with (or layered on top of) the primary OCR Profile's results.

About

You can use Layered OCR by selecting it as your OCR Engine in an OCR Profile. While not itself an OCR Engine, such as Transym or Tesseract, it allows you to obtain OCR text with multiple OCR Profiles, each using their own OCR engines.

For example, certain OCR engines have advantages over others in specific cases. Transym performs well in most cases. However, it does not do well with certain specialized fonts and print types, such as MICR or handwriting. Another engine may perform better in these cases. Microsoft's Azure Computer Vision does better than most OCR engines at recognizing handwriting (but requires a licence key from Microsoft). Google's Tesseract has the capability to train fonts to target atypical fonts and print types.

FYI

Grooper ships with both Transym and Tesseract as selectable OCR engines. Furthermore, training files for the MICR, OCR-A, and OCR-B fonts are included.

For the check below, an OCR Profile using Transym performed well generally, but failed to read the MICR line at the bottom (the routing, account, and check numbers). Tesseract got the MICR line, but had issues recognizing other parts of the check.

Transym accurately reads the check's text except for the MICR Line	Tesseract has issues with other parts of the check.

With Layered OCR you can use an OCR Profile using Transym as your primary or baseline OCR results (seen in teal), and target the MICR line with an extractor to pull the results from an OCR Profile using Tesseract (seen in orange).

This can greatly improve your OCR results. The secondary layers can target segments of text better recognized by different OCR Profiles and merge the results with your main OCR Profile.

How It Works

Layered OCR has three basic steps.

The Main OCR Profile property establishes the primary OCR Profile. Here, you will point to a configured OCR Profile you want to use as your baseline OCR.
The Layers property allows you to use secondary OCR Profiles. Here, you will add one more more Layers pointing to a second configured OCR Profile and an Extractor.
The Extractor returns segments of text recognized by the secondary (or layer) OCR Profiles, and replaces the results from the Main OCR Profile.

Layer Extractor Requirements and Considerations

!	There are some specific requirements for what results from a Layer Extractor can be merged with the main OCR results. The extractor MUST meet these requirements, or it will not replace the results from the Main OCR Profile.

The extractor must return a contiguous sequence of characters on a single line of text.

This is the most important consideration to keep in mind. This means you cannot merge OCR results if the extractor returns results on multiple lines. You cannot, for example, create an OCR Layer that merges the results of a full paragraph.
Data Type Collation Providers that produce results on multiple lines are not suitable for use, such as Arrays and Ordered Arrays in Vertical Mode.
Collation Providers that can produce non-contiguous output values, such as Arrays and Ordered Arrays, can potentially also cause some problems. Results will not merge properly unless the array elements are contiguous, meaning the text in the first array element should not skip characters between the next element (At that point a single regex pattern matching the text on the line is probably what you want).

FuzzyRegEx and FuzzyList match modes CAN be used to correct output. This can be a great way to repair labels recognized with minor OCR errors.

Keep in mind, however, the \r\n characters at the end of a line can be swapped in a fuzzy match. Be sure to list these characters as immutable in your Fuzzy Match Weightings, using the syntax Immutable=\r\n. This will prevent unintentional matches across line breaks.

DO NOT use fuzzy lexicon lookups, output formats, or lexicon translation to modify output values.

The result will not replace the main OCR results if you do. The FuzzyRegEx and FuzzyList match modes are the only ways Layered OCR can modify the results of the secondary OCR Profile before merging with the main OCR Profile's results.

Use Cases

Mixed Print Types

Layered OCR shines where you need to extract text from documents that use vastly different print types. One OCR Profile may find one print type, and another OCR Profile may find a second print type, but neither may find both. Layered OCR allows you to get the best out of both worlds, combining the results from both OCR Profiles to get a single OCR output that is more accurate than either of the profiles individually.

As we saw in the check above, Tesseract found the MICR line (using the MICR font) well, but Transym did a better job at recognizing the rest of the fonts on the document. Layered OCR allowed us to get highly accurate results using both engines. See the How To section of this article for a step-by-step guide to how this was accomplished.

Label Repair

Because FuzzyRegEx and Fuzzy List match modes are supported for Layer Extractors, Layered OCR can be used to correct OCR results before the document gets handed off to a data extraction step. For example, the label "Order Date" could be fuzzy matched against the Main OCR Profile's result "Order Dat3" and swap the "3" for an "e". This would yield the correct label "Order Date" in the layered OCR results. This will make data extraction simpler and quicker, as well as providing a document with more accurate searchable text data upon export.

!	Keep in mind, however, the `\r\n` characters at the end of a line can be swapped in a fuzzy match. Be sure to list these characters as immutable in your Fuzzy Match Weightings, using the syntax `Immutable=\r\n`. This will prevent unintentional matches across line breaks.

Custom Label Image Processing

One of the best tools in Grooper's toolbox to get accurate OCR results is the ability to perform highly configurable temporary image processing prior to OCR by setting an IP Profile on an OCR Profile. We can leverage this even further to our advantage by creating an OCR layer than uses a different set of image cleanup commands to produce better results on portions of a document than the Main OCR Profile's temporary IP Profile. See the How To section of this article for an example.

How To

Configure Layered OCR - Using two OCR Profiles with different OCR engines

Prereqs

Before you begin

For this tutorial, we will demonstrate how we merged the OCR data for a series of checks using results from two different OCR Profiles. One OCR Profile will return most of the document's text. A second will accurately return the MICR line (the routing, account, and check number). This tutorial assumes you are familiar with creating and configuring OCR Profiles. It will use two OCR Profiles as its starting point. However they are fairly basic. One uses the Transym OCR engine. One uses Tesseract.
Both also use a basic IP Profile to perform temporary image cleanup. It has only three IP Steps. All of them use the default property settings. Binarize Negative Region Removal Line Removal We've named this IP Profile "Checks - Temp"
The OCR Profile using Transym will be our Main OCR Profile, providing our baseline OCR results. We've named it "Layered OCR - Checks - Main". It simply has the OCR Engine property set to Transym OCR 4 and the IP Profile property set to our "Checks - Temp" IP Profile
The OCR Profile using Tesseract will be our OCR Layer profile. We will target the results from the MICR line recognized by this OCR Profile and merge them with the results from the "Layered - Checks - Main" OCR Profile.

Grooper Help Documentation

Version Differences

Layered OCR is a new feature added in Grooper version 2.80. Prior to this point, this functionality did not exist.

@@ Line 85: / Line 85: @@
 == How To ==
+=== Configure Layered OCR - Using two OCR Profiles with different OCR engines ===
+<tabs style="margin:20px">
+<tab name="Prereqs" style="margin:20px">
+==== Before you begin ====
+{|cellpadding=10 cellspacing=5
+|-
+|style="valign:top; width:35%"|For this tutorial, we will demonstrate how we merged the OCR data for a series of checks using results from two different '''OCR Profiles'''.  One '''OCR Profile''' will return ''most'' of the document's text.  A second will accurately return the MICR line (the routing, account, and check number).
+This tutorial assumes you are familiar with creating and configuring [[OCR Profile]]s.  It will use two '''OCR Profiles''' as its starting point.  However they are fairly basic.  One uses the Transym OCR engine.  One uses Tesseract.
+|
+[[File:Layered OCR Visual 01.png]]
+|-
+|style="width:35%"|Both also use a basic '''IP Profile''' to perform temporary image cleanup.  It has only three IP Steps.  All of them use the default property settings.
+# Binarize
+# Negative Region Removal
+# Line Removal
+We've named this '''IP Profile''' "Checks - Temp"
+|
+[[File:Layered OCR Checks 05.png]]
+|-
+|style="width:35%"|The '''OCR Profile''' using Transym will be our '''''Main OCR Profile''''', providing our baseline OCR results.
+We've named it "Layered OCR - Checks - Main".
+It simply has the '''''OCR Engine''''' property set to ''Transym OCR 4'' and the '''''IP Profile''''' property set to our "Checks - Temp" '''IP Profile'''
+|
+[[File:Layered OCR Checks 06.png]]
+|-
+|style="width:35%"|The '''OCR Profile''' using Tesseract will be our '''''OCR Layer''''' profile.  We will target the results from the MICR line recognized by this '''OCR Profile''' and merge them with the results from the "Layered - Checks - Main" '''OCR Profile'''.
+|}
+</tab>
+</tabs>
@@ Line 95: / Line 133: @@
 == Version Differences ==
-''Layered OCR'' is a new feature added in Grooper version 2.72.  Prior to this point, this functionality did not exist.
+''Layered OCR'' is a new feature added in Grooper version 2.80.  Prior to this point, this functionality did not exist.