2023:Layered OCR (OCR Engine): Difference between revisions

Latest revision as of 16:39, 21 November 2024

This article is about an older version of Grooper.

Information may be out of date and UI elements may have changed.

2025

2023

This allows you to perform highly targeted OCR on portions of documents by merging results matched by an extractor on the secondary OCR Profile's text data with the main OCR text data results. This allows Grooper users to utilize multiple OCR Profiles using different OCR engines or different IP Profiles to form a single, more accurate, text flow.

You may download and import the file(s) below into your own Grooper environment (version 2023). There is a Batch with the example document(s) discussed in this tutorial, as well as a Project configured according to its instructions.

About

You can use Layered OCR by selecting it as your OCR Engine in an OCR Profile. While not itself an OCR Engine, such as Transym or Tesseract, it allows you to obtain OCR text with multiple OCR Profiles, each using their own OCR engines.

For example, certain OCR engines have advantages over others in specific cases. Transym performs well in most cases. However, it does not do well with certain specialized fonts and print types, such as MICR or handwriting. Another engine may perform better in these cases. Microsoft's Azure Computer Vision does better than most OCR engines at recognizing handwriting (but requires a licence key from Microsoft). Google's Tesseract has the capability to train fonts to target atypical fonts and print types.

FYI

Grooper ships with both Transym and Tesseract as selectable OCR engines. Furthermore, Tesseract training files for the MICR, OCR-A, and OCR-B fonts are included.

For the check below, an OCR Profile using Transym performed well generally, but failed to read the MICR line at the bottom (the routing, account, and check numbers). Tesseract got the MICR line, but had issues recognizing other parts of the check.

Transym accurately reads the check's text except for the MICR Line	Tesseract has issues with other parts of the check.

With Layered OCR you can use an OCR Profile using Transym as your primary or baseline OCR results (seen in teal), and target the MICR line with an extractor to pull the results from an OCR Profile using Tesseract (seen in orange).

This can greatly improve your OCR results. The secondary layers can target segments of text better recognized by different OCR Profiles and merge the results with your main OCR Profile.

How It Works

Layered OCR has three basic steps.

The Main OCR Profile property establishes the primary OCR Profile. Here, you will point to a configured OCR Profile you want to use as your baseline OCR.
The Layers property allows you to use secondary OCR Profiles. Here, you will add one more more Layers pointing to a second configured OCR Profile and an Extractor.
The Extractor returns segments of text recognized by the secondary (or layer) OCR Profiles, and replaces the results from the Main OCR Profile.

Layer Extractor Requirements and Considerations

⚠	There are some specific requirements for what results from a Layer Extractor can be merged with the main OCR results. The extractor MUST meet these requirements, or it will not replace the results from the Main OCR Profile.

The extractor must return a contiguous sequence of characters on a single line of text.

This is the most important consideration to keep in mind. This means you cannot merge OCR results if the extractor returns results on multiple lines. You cannot, for example, create an OCR Layer that merges the results of a full paragraph.
Data Type Collation Providers that produce results on multiple lines are not suitable for use, such as Arrays and Ordered Arrays in Vertical Mode.
Collation Providers that can produce non-contiguous output values, such as Arrays and Ordered Arrays, can potentially also cause some problems. Results will not merge properly unless the array elements are contiguous, meaning the text in the first array element should not skip characters between the next element (At that point a single regex pattern matching the text on the line is probably what you want).

FuzzyRegEx and FuzzyList match modes CAN be used to correct output. This can be a great way to repair labels recognized with minor OCR errors.

Keep in mind, however, the \r\n characters at the end of a line can be swapped in a fuzzy match. Be sure to list these characters as immutable in your Fuzzy Match Weightings, using the syntax Immutable=\r\n. This will prevent unintentional matches across line breaks.

DO NOT use fuzzy lexicon lookups, output formats, or lexicon translation to modify output values.

The result will not replace the main OCR results if you do. The FuzzyRegEx and FuzzyList match modes are the only ways Layered OCR can modify the results of the secondary OCR Profile before merging with the main OCR Profile's results.

Use Cases

Mixed Print Types

Layered OCR shines where you need to extract text from documents that use vastly different print types. One OCR Profile may find one print type, and another OCR Profile may find a second print type, but neither may find both. Layered OCR allows you to get the best out of both worlds, combining the results from both OCR Profiles to get a single OCR output that is more accurate than either of the profiles individually.

As we saw in the check above, Tesseract found the MICR line (using the MICR font) well, but Transym did a better job at recognizing the rest of the fonts on the document. Layered OCR allowed us to get highly accurate results using both engines. See the How To section of this article for a step-by-step guide to how this was accomplished.

Label Repair

Because FuzzyRegEx and Fuzzy List match modes are supported for Layer Extractors, Layered OCR can be used to correct OCR results before the document gets handed off to a data extraction step. For example, the label "Order Date" could be fuzzy matched against the Main OCR Profile's result "Order Dat3" and swap the "3" for an "e". This would yield the correct label "Order Date" in the layered OCR results. This will make data extraction simpler and quicker, as well as providing a document with more accurate searchable text data upon export.

⚠	Keep in mind, however, the `\r\n` characters at the end of a line can be swapped in a fuzzy match. Be sure to list these characters as immutable in your Fuzzy Match Weightings, using the syntax `Immutable=\r\n`. This will prevent unintentional matches across line breaks.

Custom Label Image Processing

One of the best tools in Grooper's toolbox to get accurate OCR results is the ability to perform highly configurable temporary image processing prior to OCR by setting an IP Profile on an OCR Profile. We can leverage this even further to our advantage by creating an OCR layer than uses a different set of image cleanup commands to produce better results on portions of a document than the Main OCR Profile's temporary IP Profile. See the How To section of this article for an example.

How To

Configure Layered OCR - Using two OCR Profiles with different OCR engines

Before you begin

For this tutorial, we will demonstrate how we merged the OCR data for a series of checks using results from two different OCR Profiles. One OCR Profile will return most of the document's text. A second will accurately return the MICR line (the routing, account, and check number).

This tutorial assumes you are familiar with creating and configuring OCR Profiles, as well as temporary image processing for OCR cleanup. It will use two OCR Profiles as its starting point. However they are fairly basic. One uses the Transym OCR engine. One uses Tesseract.

Both also use a basic IP Profile to perform temporary image cleanup. It has only three IP Steps. All of them use the default property settings.

Binarize
Negative Region Removal
Line Removal

We've named this IP Profile "Checks - Temp"

The OCR Profile using Transym will be our Main OCR Profile, providing our baseline OCR results.

We've named it "Layered OCR - Checks - Main".

It simply has the OCR Engine property set to Transym OCR 4 and the IP Profile property set to our "Checks - Temp" IP Profile

The OCR Profile using Tesseract will be our OCR Layer profile. We will target the results from the MICR line recognized by this OCR Profile and merge them with the results from the "Layered - Checks - Main" OCR Profile.

We've named it "Layered OCR - Checks - MICR Line"

The following properties are configured:

The OCR Engine property is set to Tesseract OCR.
The IP Profile property is set to the "Checks - Temp" IP Profile we created.
The Maximum Variance property is set to 15% (We will note why later).
The Special Fonts property allows you to select any fonts you've trained Tesseract to recognize. Here, we've selected MICR. This is a trained font that ships with Grooper (Version 2.80 and later)

Set the Main OCR Profile

Create a new OCR Profile. We've named ours "Layered OCR - Checks".
Using the left property window (the Grooper OCR Settings), set the OCR Engine property to Layered OCR.
Using the right property window (the OCR Engine Settings), set the Main OCR Profile to the OCR Profile you wish to use for the baseline OCR results. We named ours "Layered OCR - Checks - Main".

Set the OCR Layer(s)

Select the Layers property and click the ellipsis button at the end.
The "OCR Layer Collection Editor" window will pop up. Press the "Add" button.
Using the OCR Profile property, select the OCR Profile you wish to use as your OCR layer, merging its results with the results of the main OCR Profile. We named ours "Layered OCR - Checks - MICR Line".

Set the Layer's Extractor

Next we need an extractor to tell the Layered OCR Profile what text data to merge from our OCR Layer with our main OCR results. This can be a simple Internal extractor or a Reference to a Data Type in the Node Tree. For our use case, a simple internal pattern will work. However, you can build out the extractor in the Node Tree and reference it, if you prefer.

Here, you can see our pattern. For illustrative purposes, we've run the OCR Layer Profile "Layered OCR - Checks - MICR Line" on this document so you can see the pattern matching the data we want to merge.

For these "checks" the value pattern A\d{9}A \d{7,9} C\d{4} matches these MICR lines. The weird |: character translates to an "A" and the ||' character translates to a "C".

Now that we know it matches the results from the OCR Layer's OCR Profile, we know this result will replace whatever text the main OCR Profile produces.

Press the "OK" button to finish setting up the OCR Layer.

And that's pretty much it as far as setup goes! We'll verify our results and take care of some housekeeping issues next.

Verify the Results

Switch over to the "OCR Testing Tab" to verify our results.
Select a page in the batch to test.
Press the "OCR Page" button.

Now we have a single document using results from both OCR Profiles. The MICR line "A888888888A 8475309 C1958" was recognized by the OCR Layer's OCR Profile, returned by the OCR Layer's Extractor, and layered on top of the Main OCR Profile's results, replacing the text in that location.

What happens if the OCR Layer's Extractor does not match the results from the OCR Layer's text results?

If the extractor does not return a result, there won't be any results to layer. Layered OCR will only merge OCR results if that extractor returns a result.

For the document seen here, the OCR Layer's OCR Profile did not accurately recognize the MICR line. Therefore, its Extractor didn't match the text. Therefore, it did not merge with the Main OCR Profile's results. The Main OCR Profile's results are retained, seen here.

Other Considerations

If you are eagle-eyed, you may have noticed the spacing between the characters in the MCIR line is off. The end of the line should read "C1958" but it has some extra spaces in it so it reads "C 19 58". This could potentially cause some problems when it comes time to extract data. However, if we compare the final layered OCR results to the OCR Layer's OCR results (the source of the MICR line's text data), we can see before merging the results the OCR results are spaced correctly.

The results before merging are spaced correctly.	The results after merging are not

This spacing issue is even more prevalent on other documents.

So what gives? Why does our OCR Layer's source text not match up with the final merged text?

This has to do with Grooper's OCR Synthesis properties. These are a unique set of properties used to pre-process and re-process OCR results to increase their accuracy.

You may recall, our OCR Layer's OCR Profile (named "Layered OCR - Checks - MICR Line"), we changed the Maximum Variance property to 15%.

This is one of Grooper's Synthesis properties. This property controls the allowable horizontal size difference between characters in a fixed width font. By increasing it, we effectively told the OCR to allow differences between characters in fixed width fonts to be a little bigger than normal. The outcome is the MICR line was spaced out correctly.

The OCR Profile using Layered OCR also has Synthesis properties, just like any other OCR Profile.

By default Synthesis is enabled on all OCR Profiles, and the Maximum Variance defaults to 12%.

Essentially, what has happened is our Layered OCR Profile has re-synthesized the synthesized OCR results form our OCR Layer.

When using Layered OCR be mindful of its own Synthesis and other profile settings.

You may want to disable these settings entirely in order to retain the synthesized results from your Main OCR Profile and OCR Layer Profiles.

Configure Layered OCR - Using two OCR Profiles with different temporary IP Profiles

⚠	Some of the tabs in this tutorial are longer than the others. Please scroll to the bottom of each step's tab before going to the step.

Before you begin

This tutorial assumes you are familiar with creating and configuring OCR Profiles, as well as temporary image processing for OCR cleanup, and have reviewed the previous tutorial.

The problem we are trying to address has to do with particularly bad results from some labels on one of these checks.

Notice, for this document, the "Date" "Check No." and "Amount" labels did not come through at all.

The issue has to do with the fact this particular font is inside a negative region (white text on a black background).

Since OCR needs to see black pixels in order to recognize them, we inverted this region, giving us black text. However, we didn't get quite enough of the text after the transformation. |

The Original Image	After Temporary IP Performed

|- | Layered OCR allows us to do is to use a different temporary IP Profile to perform image cleanup on these labels, giving us accurate OCR results. |

A different temporary IP Profile applied	Accurate results for the labels

The temporary IP Profile used here is a modification of the one used for the rest of these checks. The main difference is its Binarize step. Instead of using the Auto Thresholding Method, this one uses Simple with the Threshold manually set to 38. This low threshold will preserve more of the white text on the document. For most of the text, this is going to be a bad thing, as the light pixels around the black text are going to degrade those characters. However, since these labels ("Date" "Check No." and "Amount") are white text inside a black border, they will actually be preserved better once we invert them with a Negative Region Removal step. And, since we are only targeting these labels in our OCR Layer, it doesn't matter what happens to the rest of the text because it won't be merged with the main text data.

The steps in this IP Profile (which we've named "Checks - Inverted Labels") are as follows.

Binarize
- Thresholding Method set to Simple
- Threshold set to 38
Negative Region Removal
Line Removal
If you want, you may add Speck Removal
- For some additional artifact cleanup around the inverted labels.

Create the OCR Layer's OCR Profile

For this OCR Layer, we're just going to copy the "Layered OCR - Checks - Main" OCR Profile and change the IP Profile property.

Here we've copied the "Layered OCR - Checks - Main" OCR Profile and named it "Layered OCR - Checks - Inverted Labels".
Select the IP Profile property. Using the dropdown menu, select the IP Profile referred to in the previous step named "Checks - Inverted Labels"

Add a New OCR Layer

In the Node Tree, select the OCR Profile using Layered OCR
- The one we've been using is named "Layered OCR - Checks"
Select the Layers property in the right-hand property window
Press the ellipsis button at the end. This will bring up the "OCR Layer Collection Editor".

Press the "Add" button.
Select the OCR Profile property. Click the drop-down menu.
Select the "Layered OCR - Checks - Inverted Labels" OCR Profile.

Configure the OCR Layer's Extractor

Next, we need to determine which text will be merged. The OCR Layer's Extractor property allows us to use an Internal or Reference extractor to return text from the OCR Layer's OCR results and merge it with the main OCR Profile's results.

For this example a simple pattern looking for these labels will work just fine. We will use date check no\. amount for the value pattern.

We've set that pattern as an Internal extractor for this layer.
Press the "OK" button to finish configuring the OCR Layer.

Verify the Results

Switch over to the "OCR Testing Tab" to verify our results.
Select a page in the batch to test.
Press the "OCR Page" button.

Now we have a single document using results from our main OCR Profile merged with the results from both OCR Layers. The MICR line "A987654321A 987654321 C1492" was merged from the results of the first OCR Layer (configured in the previous tutorial) and the "Date" Check No." and "Amount" labels were merged from the second OCR Layer, which also uses Transym as its OCR engine, but uses a different temporary IP Profile to better target those labels.