2023:Layered OCR (OCR Engine): Difference between revisions

From Grooper Wiki
No edit summary
No edit summary
 
(18 intermediate revisions by 2 users not shown)
Line 1: Line 1:
{{AutoVersion}}
{{AutoVersion}}
[[File:Layered_OCR_Header_Image.png|right|700px]]
[[File:Layered_OCR_Header_Image.png|right|700px]]


<blockquote style="font-size:14pt">
<blockquote>{{#lst:Glossary|Layered OCR}}</blockquote>
''Layered OCR'' enables you to run secondary '''[[OCR Profile]]s''' on a single page.  The [[OCR]] results from these secondary '''OCR Profiles''' are merged with (or ''layered'' on top of) the primary '''OCR Profile's''' results.
</blockquote>


This allows you to perform highly targeted OCR on portions of documents by merging results matched by an extractor on the secondary '''OCR Profile's''' text data with the main OCR text data results.  This allows Grooper users to utilize multiple '''OCR Profiles''' using different OCR engines or different '''IP Profiles''' to form a single, more accurate, text flow.
This allows you to perform highly targeted OCR on portions of documents by merging results matched by an extractor on the secondary '''OCR Profile's''' text data with the main OCR text data results.  This allows Grooper users to utilize multiple '''OCR Profiles''' using different OCR engines or different '''IP Profiles''' to form a single, more accurate, text flow.
<!--#region About-->
<br clear=all>
== About ==
{|class="download-box"
{|cellpadding="10" cellspacing="5"  
|
|-
[[File:Asset 22@4x.png]]
|style="font-size:14pt; color:#f89420; border: 2px solid #f89420; width:40px"|[[File:Asset 22@4x.png]]
|
|style="border: 2px solid #f89420"|
You may download and import the file(s) below into your own Grooper environment (version 2023).  There is a '''Batch''' with the example document(s) discussed in this tutorial, as well as a '''Project''' configured according to its instructions.
You may download and import the file(s) below into your own Grooper environment (version 2023).  There is a '''Batch''' with the example document(s) discussed in this tutorial, as well as a '''Project''' configured according to its instructions.
* [[Media:Layered_OCR_-_Project_(v2023).zip]]
* [[Media:2023_Wiki_Layered-OCR_Project.zip]]
* [[Media:Layered_OCR_-_Batch_(v2023).zip]]
* [[Media:2023_Wiki_Layered-OCR_Batch.zip]]
|}
|}


 
<!--#region About-->
== About ==
You can use ''Layered OCR'' by selecting it as your '''''OCR Engine''''' in an '''OCR Profile'''.  While not itself an [[OCR Engine]], such as Transym or Tesseract, it allows you to obtain OCR text with multiple '''OCR Profiles''', each using their own OCR engines.
You can use ''Layered OCR'' by selecting it as your '''''OCR Engine''''' in an '''OCR Profile'''.  While not itself an [[OCR Engine]], such as Transym or Tesseract, it allows you to obtain OCR text with multiple '''OCR Profiles''', each using their own OCR engines.


 
[[File:2023_Layered-OCR_01_About_01.png]]
[[File:2023_Layered-OCR_01_About_01.png|center|1000px]]




For example, certain OCR engines have advantages over others in specific cases.  Transym performs well in most cases.  However, it does not do well with certain specialized fonts and print types, such as [https://en.wikipedia.org/wiki/Magnetic_ink_character_recognition MICR] or handwriting.  Another engine may perform better in these cases.  Microsoft's [https://azure.microsoft.com/en-us/services/cognitive-services/computer-vision/ Azure Computer Vision] does better than most OCR engines at recognizing handwriting (but requires a licence key from Microsoft).  Google's [https://en.wikipedia.org/wiki/Tesseract_(software) Tesseract] has the capability to train fonts to target atypical fonts and print types.
For example, certain OCR engines have advantages over others in specific cases.  Transym performs well in most cases.  However, it does not do well with certain specialized fonts and print types, such as [https://en.wikipedia.org/wiki/Magnetic_ink_character_recognition MICR] or handwriting.  Another engine may perform better in these cases.  Microsoft's [https://azure.microsoft.com/en-us/services/cognitive-services/computer-vision/ Azure Computer Vision] does better than most OCR engines at recognizing handwriting (but requires a licence key from Microsoft).  Google's [https://en.wikipedia.org/wiki/Tesseract_(software) Tesseract] has the capability to train fonts to target atypical fonts and print types.


{|cellpadding="10" cellspacing="5"
{|class="fyi-box"
|-style="background-color:#36b0a7; color:white"
|
|style="font-size:14pt"|'''FYI'''
'''FYI'''
|Grooper ships with both Transym and Tesseract as selectable OCR engines.  Furthermore, training files for the MICR, OCR-A, and OCR-B fonts are included.
|
Grooper ships with both Transym and Tesseract as selectable OCR engines.  Furthermore, Tesseract training files for the MICR, OCR-A, and OCR-B fonts are included.
|}
|}


Line 62: Line 61:




{|cellpadding="10" cellspacing="5"
{|class="attn-box"
|-style="background-color:#f89420; color:white"
|
|style="font-size:14pt"|'''!'''
&#9888;
|There are some specific requirements for what results from a '''''Layer Extractor''''' can be merged with the main OCR results.  The extractor MUST meet these requirements, or it will not replace the results from the '''''Main OCR Profile'''''.
|
There are some specific requirements for what results from a '''''Layer Extractor''''' can be merged with the main OCR results.  The extractor MUST meet these requirements, or it will not replace the results from the '''''Main OCR Profile'''''.
|}
|}


Line 80: Line 80:
<!--#endregion-->
<!--#endregion-->
<!--#region Use Cases-->
<!--#region Use Cases-->
== Use Cases ==
== Use Cases ==


Line 92: Line 93:
Because ''FuzzyRegEx'' and ''Fuzzy List'' match modes are supported for Layer Extractors, Layered OCR can be used to correct OCR results ''before'' the document gets handed off to a data extraction step.  For example, the label "Order Date" could be fuzzy matched against the '''''Main OCR Profile's''''' result "Order Dat3" and swap the "3" for an "e".  This would yield the correct label "Order Date" in the layered OCR results.  This will make data extraction simpler and quicker, as well as providing a document with more accurate searchable text data upon export.
Because ''FuzzyRegEx'' and ''Fuzzy List'' match modes are supported for Layer Extractors, Layered OCR can be used to correct OCR results ''before'' the document gets handed off to a data extraction step.  For example, the label "Order Date" could be fuzzy matched against the '''''Main OCR Profile's''''' result "Order Dat3" and swap the "3" for an "e".  This would yield the correct label "Order Date" in the layered OCR results.  This will make data extraction simpler and quicker, as well as providing a document with more accurate searchable text data upon export.


{|cellpadding="10" cellspacing="5"
{|class="attn-box"
|-style="background-color:#f89420; color:white"
|
|style="font-size:14pt"|'''!'''
&#9888;
|Keep in mind, however, the <code>\r\n</code> characters at the end of a line can be swapped in a fuzzy match.  Be sure to list these characters as immutable in your Fuzzy Match Weightings, using the syntax <code>Immutable=\r\n</code>.  This will prevent unintentional matches across line breaks.
|
Keep in mind, however, the <code>\r\n</code> characters at the end of a line can be swapped in a fuzzy match.  Be sure to list these characters as immutable in your Fuzzy Match Weightings, using the syntax <code>Immutable=\r\n</code>.  This will prevent unintentional matches across line breaks.
|}
|}


Line 104: Line 106:
<!--#region How To-->
<!--#region How To-->
== How To ==
== How To ==
You can download and import the zip file below into your Grooper (Version 2.80) environment to follow along with these tutorials.
* [[Media:Layered OCR (2.80).zip|Layered OCR (2.80).zip]]
=== Configure Layered OCR - Using two OCR Profiles with different OCR engines ===
=== Configure Layered OCR - Using two OCR Profiles with different OCR engines ===
==== Before you begin ====
For this tutorial, we will demonstrate how we merged the OCR data for a series of checks using results from two different '''OCR Profiles'''.  One '''OCR Profile''' will return ''most'' of the document's text.  A second will accurately return the MICR line (the routing, account, and check number).


{|cellpadding="10" cellspacing="5"
This tutorial assumes you are familiar with creating and configuring [[OCR Profile]]s, as well as temporary [[Image Processing (Concept)|image processing]] for OCR cleanup.  It will use two '''OCR Profiles''' as its starting point.  However they are fairly basic.  One uses the Transym OCR engineOne uses Tesseract.
|-style="background-color:#f89420; color:white"
|style="font-size:14pt"|'''!'''
|Some of the tabs in this tutorial are longer than the othersPlease scroll to the bottom of each step's tab before going to the step.
|}


<tabs style="margin:20px">
[[File:Layered OCR Visual 01.png]]
<tab name="Prereqs" style="margin:20px">


==== Before you begin ====


{|cellpadding=10 cellspacing=5
Both also use a basic '''IP Profile''' to perform temporary image cleanup.  It has only three IP Steps.  All of them use the default property settings.
|-
|style="valign:top; width:35%"|For this tutorial, we will demonstrate how we merged the OCR data for a series of checks using results from two different '''OCR Profiles'''.  One '''OCR Profile''' will return ''most'' of the document's text.  A second will accurately return the MICR line (the routing, account, and check number).
 
This tutorial assumes you are familiar with creating and configuring [[OCR Profile]]s, as well as temporary [[Image Processing (Concept)|image processing]] for OCR cleanup.  It will use two '''OCR Profiles''' as its starting point.  However they are fairly basic.  One uses the Transym OCR engine.  One uses Tesseract.
|
[[File:Layered OCR Visual 01.png]]
|-
|style="width:40%"|Both also use a basic '''IP Profile''' to perform temporary image cleanup.  It has only three IP Steps.  All of them use the default property settings.


# Binarize
# Binarize
Line 136: Line 122:


We've named this '''IP Profile''' "Checks - Temp"
We've named this '''IP Profile''' "Checks - Temp"
|
 
[[File:2023_Layered-OCR_02_HowTo_01.png]]
[[File:2023_Layered-OCR_02_HowTo_01.png]]
|-
 
|style="width:40%"|The '''OCR Profile''' using Transym will be our '''''Main OCR Profile''''', providing our baseline OCR results.
 
The '''OCR Profile''' using Transym will be our '''''Main OCR Profile''''', providing our baseline OCR results.


We've named it "Layered OCR - Checks - Main".
We've named it "Layered OCR - Checks - Main".


It simply has the '''''OCR Engine''''' property set to ''Transym OCR 4'' and the '''''IP Profile''''' property set to our "Checks - Temp" '''IP Profile'''
It simply has the '''''OCR Engine''''' property set to ''Transym OCR 4'' and the '''''IP Profile''''' property set to our "Checks - Temp" '''IP Profile'''
|
 
[[File:2023_Layered-OCR_02_HowTo_02.png]]
[[File:2023_Layered-OCR_02_HowTo_02.png]]
|-
 
|style="width:40%"|The '''OCR Profile''' using Tesseract will be our '''''OCR Layer''''' profile.  We will target the results from the MICR line recognized by this '''OCR Profile''' and merge them with the results from the "Layered - Checks - Main" '''OCR Profile'''.
 
The '''OCR Profile''' using Tesseract will be our '''''OCR Layer''''' profile.  We will target the results from the MICR line recognized by this '''OCR Profile''' and merge them with the results from the "Layered - Checks - Main" '''OCR Profile'''.


We've named it "Layered OCR - Checks - MICR Line"
We've named it "Layered OCR - Checks - MICR Line"
Line 157: Line 145:
# The '''''Maximum Variance''''' property is set to ''15%'' (We will note why later).
# The '''''Maximum Variance''''' property is set to ''15%'' (We will note why later).
# The '''''Special Fonts''''' property allows you to select any fonts you've trained Tesseract to recognize.  Here, we've selected ''MICR''.  This is a trained font that ships with Grooper (Version 2.80 and later)
# The '''''Special Fonts''''' property allows you to select any fonts you've trained Tesseract to recognize.  Here, we've selected ''MICR''.  This is a trained font that ships with Grooper (Version 2.80 and later)
|
 
[[File:2023_Layered-OCR_02_HowTo_03.png]]
[[File:2023_Layered-OCR_02_HowTo_03.png]]
|}
[[#Configure Layered OCR - Using two OCR Profiles with different OCR engines|Click here to go back to the top]]
</tab>
<tab name="Step 1" style="margin:20px">


==== Set the Main OCR Profile ====
==== Set the Main OCR Profile ====
{|cellpadding=10 cellspacing=5
|style="width:40%|
# Create a new '''OCR Profile'''.  We've named ours "Layered OCR - Checks".
# Create a new '''OCR Profile'''.  We've named ours "Layered OCR - Checks".
# Using the left property window (the Grooper OCR Settings), set the '''''OCR Engine''''' property to ''Layered OCR''.
# Using the left property window (the Grooper OCR Settings), set the '''''OCR Engine''''' property to ''Layered OCR''.
# Using the right property window (the OCR Engine Settings), set the '''''Main OCR Profile''''' to the '''OCR Profile''' you wish to use for the baseline OCR results.  We named ours "Layered OCR - Checks - Main".
# Using the right property window (the OCR Engine Settings), set the '''''Main OCR Profile''''' to the '''OCR Profile''' you wish to use for the baseline OCR results.  We named ours "Layered OCR - Checks - Main".
|
 
[[File:2023_Layered-OCR_02_HowTo_04.png]]
[[File:2023_Layered-OCR_02_HowTo_04.png]]
|}
[[#Configure Layered OCR - Using two OCR Profiles with different OCR engines|Click here to go back to the top]]
</tab>
<tab name="Step 2" style="margin:20px">


==== Set the OCR Layer(s) ====
==== Set the OCR Layer(s) ====
{|cellpadding=10 cellspacing=5
|style="width:40%|
# Select the '''''Layers''''' property and click the ellipsis button at the end.
# Select the '''''Layers''''' property and click the ellipsis button at the end.
# The "OCR Layer Collection Editor" window will pop up.  Press the "Add" button.
# The "OCR Layer Collection Editor" window will pop up.  Press the "Add" button.
# Using the '''''OCR Profile''''' property, select the '''OCR Profile''' you wish to use as your OCR layer, merging its results with the results of the main '''OCR Profile'''.  We named ours "Layered OCR - Checks - MICR Line".
# Using the '''''OCR Profile''''' property, select the '''OCR Profile''' you wish to use as your OCR layer, merging its results with the results of the main '''OCR Profile'''.  We named ours "Layered OCR - Checks - MICR Line".
|
 
[[File:2023_Layered-OCR_02_HowTo_05.png]]
[[File:2023_Layered-OCR_02_HowTo_05.png]]


|}
==== Set the Layer's Extractor ====
Next we need an extractor to tell the Layered OCR Profile what text data to merge from our OCR Layer with our main OCR results.  This can be a simple ''Internal'' extractor or a ''Reference'' to a '''Data Type''' in the Node Tree.  For our use case, a simple internal pattern will work.  However, you can build out the extractor in the Node Tree and reference it, if you prefer.


[[#Configure Layered OCR - Using two OCR Profiles with different OCR engines|Click here to go back to the top]]
[[File:2023_Layered-OCR_02_HowTo_06.png]]
</tab>
<tab name="Step 3" style="margin:20px">


==== Set the Layer's Extractor ====


{|cellpadding=10 cellspacing+5
|style="width:40%|
Next we need an extractor to tell the Layered OCR Profile what text data to merge from our OCR Layer with our main OCR results.  This can be a simple ''Internal'' extractor or a ''Reference'' to a '''Data Type''' in the Node Tree.  For our use case, a simple internal pattern will work.  However, you can build out the extractor in the Node Tree and reference it, if you prefer.
|
[[File:2023_Layered-OCR_02_HowTo_06.png]]
|-
|
Here, you can see our pattern.  For illustrative purposes, we've run the OCR Layer Profile "Layered OCR - Checks - MICR Line" on this document so you can see the pattern matching the data we want to merge.
Here, you can see our pattern.  For illustrative purposes, we've run the OCR Layer Profile "Layered OCR - Checks - MICR Line" on this document so you can see the pattern matching the data we want to merge.


Line 211: Line 173:


Now that we know it matches the results from the OCR Layer's '''OCR Profile''', we know this result will replace whatever text the main '''OCR Profile''' produces.
Now that we know it matches the results from the OCR Layer's '''OCR Profile''', we know this result will replace whatever text the main '''OCR Profile''' produces.
|
 
[[File:2023_Layered-OCR_02_HowTo_07.png]]
[[File:2023_Layered-OCR_02_HowTo_07.png]]
|-
 
|
 
Press the "OK" button to finish setting up the OCR Layer.
Press the "OK" button to finish setting up the OCR Layer.
|
 
[[File:2023_Layered-OCR_02_HowTo_08.png]]
[[File:2023_Layered-OCR_02_HowTo_08.png]]
|}
 


And that's pretty much it as far as setup goes!  We'll verify our results and take care of some housekeeping issues next.
And that's pretty much it as far as setup goes!  We'll verify our results and take care of some housekeeping issues next.


[[#Configure Layered OCR - Using two OCR Profiles with different OCR engines|Click here to go back to the top]]
</tab>
<tab name="Step 4" style="margin:20px">
==== Verify the Results ====
==== Verify the Results ====
{|cellpadding=10 cellspacing=5
|style="width:40%|
# Switch over to the "OCR Testing Tab" to verify our results.
# Switch over to the "OCR Testing Tab" to verify our results.
# Select a page in the batch to test.
# Select a page in the batch to test.
Line 234: Line 190:


Now we have a single document using results from both OCR Profiles.  The MICR line "A888888888A 8475309 C1958" was recognized by the OCR Layer's '''''OCR Profile''''', returned by the OCR Layer's '''''Extractor''''', and layered on top of the '''''Main OCR Profile's''''' results, replacing the text in that location.
Now we have a single document using results from both OCR Profiles.  The MICR line "A888888888A 8475309 C1958" was recognized by the OCR Layer's '''''OCR Profile''''', returned by the OCR Layer's '''''Extractor''''', and layered on top of the '''''Main OCR Profile's''''' results, replacing the text in that location.
|
 
[[File:2023_Layered-OCR_02_HowTo_09.png]]
[[File:2023_Layered-OCR_02_HowTo_09.png]]
|-
 
|
 
What happens if the OCR Layer's '''''Extractor''''' does not match the results from the OCR Layer's text results?
What happens if the OCR Layer's '''''Extractor''''' does not match the results from the OCR Layer's text results?


Line 243: Line 199:


For the document seen here, the OCR Layer's OCR Profile did not accurately recognize the MICR line.  Therefore, its Extractor didn't match the text.  Therefore, it did not merge with the Main OCR Profile's results.  The Main OCR Profile's results are retained, seen here.
For the document seen here, the OCR Layer's OCR Profile did not accurately recognize the MICR line.  Therefore, its Extractor didn't match the text.  Therefore, it did not merge with the Main OCR Profile's results.  The Main OCR Profile's results are retained, seen here.
|
 
[[File:2023_Layered-OCR_02_HowTo_10.png]]
[[File:2023_Layered-OCR_02_HowTo_10.png]]
|}


==== Other Considerations ====
==== Other Considerations ====
If you are eagle-eyed, you may have noticed the spacing between the characters in the MCIR line is off.  The end of the line should read "C1958" but it has some extra spaces in it so it reads "C 19 58".  This could potentially cause some problems when it comes time to extract data.  However, if we compare the final layered OCR results to the OCR Layer's OCR results (the source of the MICR line's text data), we can see before merging the results the OCR results are spaced correctly.
If you are eagle-eyed, you may have noticed the spacing between the characters in the MCIR line is off.  The end of the line should read "C1958" but it has some extra spaces in it so it reads "C 19 58".  This could potentially cause some problems when it comes time to extract data.  However, if we compare the final layered OCR results to the OCR Layer's OCR results (the source of the MICR line's text data), we can see before merging the results the OCR results are spaced correctly.


Line 262: Line 216:




{|cellpadding=10 cellspacing=5
|style="width:40%"|
So what gives?  Why does our OCR Layer's source text not match up with the final merged text?
So what gives?  Why does our OCR Layer's source text not match up with the final merged text?


Line 273: Line 225:


This is one of Grooper's Synthesis properties.  This property controls the allowable horizontal size difference between characters in a fixed width font.  By increasing it, we effectively told the OCR to allow differences between characters in fixed width fonts to be a little bigger than normal.  The outcome is the MICR line was spaced out correctly.
This is one of Grooper's Synthesis properties.  This property controls the allowable horizontal size difference between characters in a fixed width font.  By increasing it, we effectively told the OCR to allow differences between characters in fixed width fonts to be a little bigger than normal.  The outcome is the MICR line was spaced out correctly.
|
 
[[File:2023_Layered-OCR_02_HowTo_11.png]]
[[File:2023_Layered-OCR_02_HowTo_11.png]]
|-
 
|
 
The '''OCR Profile''' using ''Layered OCR'' also has '''''Synthesis''''' properties, just like any other '''OCR Profile'''.
The '''OCR Profile''' using ''Layered OCR'' also has '''''Synthesis''''' properties, just like any other '''OCR Profile'''.


Line 282: Line 234:


Essentially, what has happened is our Layered OCR Profile has ''re-synthesized'' the synthesized OCR results form our OCR Layer.
Essentially, what has happened is our Layered OCR Profile has ''re-synthesized'' the synthesized OCR results form our OCR Layer.
|
 
[[File:2023_Layered-OCR_02_HowTo_12.png]]
[[File:2023_Layered-OCR_02_HowTo_12.png]]
|-
 
|When using ''Layered OCR'' be mindful of its own '''''Synthesis''''' and other profile settings.
 
When using ''Layered OCR'' be mindful of its own '''''Synthesis''''' and other profile settings.


You may want to disable these settings entirely in order to retain the synthesized results from your Main OCR Profile and OCR Layer Profiles.
You may want to disable these settings entirely in order to retain the synthesized results from your Main OCR Profile and OCR Layer Profiles.
|
 
[[File:2023_Layered-OCR_02_HowTo_13.png]]
[[File:2023_Layered-OCR_02_HowTo_13.png]]
|}
</tab>
</tabs>


=== Configure Layered OCR - Using two OCR Profiles with different temporary IP Profiles ===
=== Configure Layered OCR - Using two OCR Profiles with different temporary IP Profiles ===


{|cellpadding="10" cellspacing="5"
{|class="attn-box"
|-style="background-color:#f89420; color:white"
|
|style="font-size:14pt"|'''!'''
&#9888;
|Some of the tabs in this tutorial are longer than the others.  Please scroll to the bottom of each step's tab before going to the step.
|
Some of the tabs in this tutorial are longer than the others.  Please scroll to the bottom of each step's tab before going to the step.
|}
|}


<tabs style="margin:20px">
<tab name="Prereqs" style="margin:20px">
==== Before you begin ====
==== Before you begin ====
{|cellspacing=10 cellpadding=5
|style="width:40%"|
This tutorial assumes you are familiar with creating and configuring [[OCR Profile]]s, as well as temporary [[Image Processing (Concept)|image processing]] for OCR cleanup, and have reviewed the previous tutorial.
This tutorial assumes you are familiar with creating and configuring [[OCR Profile]]s, as well as temporary [[Image Processing (Concept)|image processing]] for OCR cleanup, and have reviewed the previous tutorial.




Line 318: Line 260:


Notice, for this document, the "Date" "Check No." and "Amount" labels did not come through at all.
Notice, for this document, the "Date" "Check No." and "Amount" labels did not come through at all.
|
 
[[File:2023_Layered-OCR_02_HowTo_14.png]]
[[File:2023_Layered-OCR_02_HowTo_14.png]]
|-
 
|
 
The issue has to do with the fact this particular font is inside a negative region (white text on a black background).
The issue has to do with the fact this particular font is inside a negative region (white text on a black background).


Line 340: Line 282:
|[[File:Layered OCR Label IP 03.png]]||[[File:Layered OCR Label IP 04.png]]
|[[File:Layered OCR Label IP 03.png]]||[[File:Layered OCR Label IP 04.png]]
|}
|}
|-
 
|style="width:40%"|
 
The temporary '''IP Profile''' used here is a modification of the one used for the rest of these checks.  The main difference is its '''Binarize''' step.  Instead of using the ''Auto'' '''''Thresholding Method''''', this one uses ''Simple'' with the '''''Threshold''''' manually set to ''38''.  This low threshold will preserve more of the white text on the document.  For most of the text, this is going to be a bad thing, as the light pixels around the black text are going to degrade those characters.  However, since these labels ("Date" "Check No." and "Amount") are white text inside a black border, they will actually be preserved ''better'' once we invert them with a '''Negative Region Removal''' step.  And, since we are only targeting these labels in our OCR Layer, it doesn't matter what happens to the rest of the text because it ''won't'' be merged with the main text data.
The temporary '''IP Profile''' used here is a modification of the one used for the rest of these checks.  The main difference is its '''Binarize''' step.  Instead of using the ''Auto'' '''''Thresholding Method''''', this one uses ''Simple'' with the '''''Threshold''''' manually set to ''38''.  This low threshold will preserve more of the white text on the document.  For most of the text, this is going to be a bad thing, as the light pixels around the black text are going to degrade those characters.  However, since these labels ("Date" "Check No." and "Amount") are white text inside a black border, they will actually be preserved ''better'' once we invert them with a '''Negative Region Removal''' step.  And, since we are only targeting these labels in our OCR Layer, it doesn't matter what happens to the rest of the text because it ''won't'' be merged with the main text data.


Line 353: Line 295:
# If you want, you may add Speck Removal
# If you want, you may add Speck Removal
#* For some additional artifact cleanup around the inverted labels.
#* For some additional artifact cleanup around the inverted labels.
|
 
[[File:2023_Layered-OCR_02_HowTo_15.png]]
[[File:2023_Layered-OCR_02_HowTo_15.png]]
|}


[[#Configure Layered OCR - Using two OCR Profiles with different temporary IP Profiles|Click here to go back to the top]]
</tab>
<tab name="Step 1" style="margin:20px">
==== Create the OCR Layer's OCR Profile ====
==== Create the OCR Layer's OCR Profile ====
{|cellpadding=10 cellspacing=5
|style="width:40%"|
For this OCR Layer, we're just going to copy the "Layered OCR - Checks - Main" '''OCR Profile''' and change the '''''IP Profile''''' property.
For this OCR Layer, we're just going to copy the "Layered OCR - Checks - Main" '''OCR Profile''' and change the '''''IP Profile''''' property.


# Here we've copied the "Layered OCR - Checks - Main" '''OCR Profile''' and named it "Layered OCR - Checks - Inverted Labels".
# Here we've copied the "Layered OCR - Checks - Main" '''OCR Profile''' and named it "Layered OCR - Checks - Inverted Labels".
# Select the '''''IP Profile''''' property.  Using the dropdown menu, select the '''IP Profile''' referred to in the previous step named "Checks - Inverted Labels"
# Select the '''''IP Profile''''' property.  Using the dropdown menu, select the '''IP Profile''' referred to in the previous step named "Checks - Inverted Labels"
|
 
[[File:2023_Layered-OCR_02_HowTo_16.png]]
[[File:2023_Layered-OCR_02_HowTo_16.png]]
|}


[[#Configure Layered OCR - Using two OCR Profiles with different temporary IP Profiles|Click here to go back to the top]]
</tab>
<tab name="Step 2" style="margin:20px">
==== Add a New OCR Layer ====
==== Add a New OCR Layer ====
{|cellpadding=10 cellspacing=5
|style="width:40%"|
# In the Node Tree, select the '''OCR Profile''' using ''Layered OCR''
# In the Node Tree, select the '''OCR Profile''' using ''Layered OCR''
#* The one we've been using is named "Layered OCR - Checks"
#* The one we've been using is named "Layered OCR - Checks"
# Select the '''''Layers''''' property in the right-hand property window
# Select the '''''Layers''''' property in the right-hand property window
# Press the ellipsis button at the end.  This will bring up the "OCR Layer Collection Editor".
# Press the ellipsis button at the end.  This will bring up the "OCR Layer Collection Editor".
|
 
[[File:2023_Layered-OCR_02_HowTo_17.png]]
[[File:2023_Layered-OCR_02_HowTo_17.png]]
|-
 
|
 
# Press the "Add" button.
# Press the "Add" button.
# Select the '''''OCR Profile''''' property.  Click the drop-down menu.
# Select the '''''OCR Profile''''' property.  Click the drop-down menu.
# Select the "Layered OCR - Checks - Inverted Labels" OCR Profile.
# Select the "Layered OCR - Checks - Inverted Labels" OCR Profile.
|
 
[[File:2023_Layered-OCR_02_HowTo_18.png]]
[[File:2023_Layered-OCR_02_HowTo_18.png]]
|}


[[#Configure Layered OCR - Using two OCR Profiles with different temporary IP Profiles|Click here to go back to the top]]
</tab>
<tab name="Step 3" style="margin:20px">
==== Configure the OCR Layer's Extractor ====
==== Configure the OCR Layer's Extractor ====
Next, we need to determine which text will be merged.  The OCR Layer's '''''Extractor''''' property allows us to use an ''Internal'' or ''Reference'' extractor to return text from the OCR Layer's OCR results and merge it with the main '''OCR Profile's''' results.


{|cellpadding=10 cellspacing=5
|
Next, we need to determine which text will be merged.  The OCR Layer's '''''Extractor''''' property allows us to use an ''Internal'' or ''Reference'' extractor to return text from the OCR Layer's OCR results and merge it with the main '''OCR Profile's''' results.


For this example a simple pattern looking for these labels will work just fine.  We will use <code>date check no\. amount</code> for the value pattern.
For this example a simple pattern looking for these labels will work just fine.  We will use <code>date check no\. amount</code> for the value pattern.
|
 
[[File:2023_Layered-OCR_02_HowTo_19.png]]
[[File:2023_Layered-OCR_02_HowTo_19.png]]
|-
 
|
 
# We've set that pattern as an ''Internal'' extractor for this layer.
# We've set that pattern as an ''Internal'' extractor for this layer.
# Press the "OK" button to finish configuring the OCR Layer.
# Press the "OK" button to finish configuring the OCR Layer.
|
 
[[File:2023_Layered-OCR_02_HowTo_20.png]]
[[File:2023_Layered-OCR_02_HowTo_20.png]]
|}


[[#Configure Layered OCR - Using two OCR Profiles with different temporary IP Profiles|Click here to go back to the top]]
</tab>
<tab name="Step 4" style="margin:20px">
==== Verify the Results ====
==== Verify the Results ====
{|cellpadding=10 cellspacing=5
|style="width:40%|
# Switch over to the "OCR Testing Tab" to verify our results.
# Switch over to the "OCR Testing Tab" to verify our results.
# Select a page in the batch to test.
# Select a page in the batch to test.
Line 426: Line 341:


Now we have a single document using results from our main '''OCR Profile''' merged with the results from ''both'' OCR Layers.  The MICR line "A987654321A 987654321 C1492" was merged from the results of the first OCR Layer (configured in the previous tutorial) and the "Date" Check No." and "Amount" labels were merged from the second OCR Layer, which also uses Transym as its OCR engine, but uses a different temporary ''IP Profile'' to better target those labels.
Now we have a single document using results from our main '''OCR Profile''' merged with the results from ''both'' OCR Layers.  The MICR line "A987654321A 987654321 C1492" was merged from the results of the first OCR Layer (configured in the previous tutorial) and the "Date" Check No." and "Amount" labels were merged from the second OCR Layer, which also uses Transym as its OCR engine, but uses a different temporary ''IP Profile'' to better target those labels.
|
 
[[File:2023_Layered-OCR_02_HowTo_21.png]]
[[File:2023_Layered-OCR_02_HowTo_21.png]]
|}
</tab>
</tabs>
<!--#endregion-->
== Grooper Help Documentation ==
* [http://grooper.bisok.com/Documentation/2.80/Main/HTML5/index.htm#t=Object_Reference%2FOCR_Engine%2FLayered_OCR.htm Layered OCR]
* [http://grooper.bisok.com/Documentation/2.80/Main/HTML5/index.htm#t=Object_Reference%2FOther%2FOCR_Layer.htm OCR Layer]
[[Category:Articles]]
[[Category:Version 2023]]

Latest revision as of 16:39, 21 November 2024

This article is about an older version of Grooper.

Information may be out of date and UI elements may have changed.

202520232.80

This allows you to perform highly targeted OCR on portions of documents by merging results matched by an extractor on the secondary OCR Profile's text data with the main OCR text data results. This allows Grooper users to utilize multiple OCR Profiles using different OCR engines or different IP Profiles to form a single, more accurate, text flow.

You may download and import the file(s) below into your own Grooper environment (version 2023). There is a Batch with the example document(s) discussed in this tutorial, as well as a Project configured according to its instructions.

About

You can use Layered OCR by selecting it as your OCR Engine in an OCR Profile. While not itself an OCR Engine, such as Transym or Tesseract, it allows you to obtain OCR text with multiple OCR Profiles, each using their own OCR engines.


For example, certain OCR engines have advantages over others in specific cases. Transym performs well in most cases. However, it does not do well with certain specialized fonts and print types, such as MICR or handwriting. Another engine may perform better in these cases. Microsoft's Azure Computer Vision does better than most OCR engines at recognizing handwriting (but requires a licence key from Microsoft). Google's Tesseract has the capability to train fonts to target atypical fonts and print types.

FYI

Grooper ships with both Transym and Tesseract as selectable OCR engines. Furthermore, Tesseract training files for the MICR, OCR-A, and OCR-B fonts are included.


For the check below, an OCR Profile using Transym performed well generally, but failed to read the MICR line at the bottom (the routing, account, and check numbers). Tesseract got the MICR line, but had issues recognizing other parts of the check.
Transym accurately reads the check's text except for the MICR Line Tesseract has issues with other parts of the check.
With Layered OCR you can use an OCR Profile using Transym as your primary or baseline OCR results (seen in teal), and target the MICR line with an extractor to pull the results from an OCR Profile using Tesseract (seen in orange).


This can greatly improve your OCR results. The secondary layers can target segments of text better recognized by different OCR Profiles and merge the results with your main OCR Profile.

How It Works

Layered OCR has three basic steps.

  1. The Main OCR Profile property establishes the primary OCR Profile. Here, you will point to a configured OCR Profile you want to use as your baseline OCR.
  2. The Layers property allows you to use secondary OCR Profiles. Here, you will add one more more Layers pointing to a second configured OCR Profile and an Extractor.
  3. The Extractor returns segments of text recognized by the secondary (or layer) OCR Profiles, and replaces the results from the Main OCR Profile.

Layer Extractor Requirements and Considerations

There are some specific requirements for what results from a Layer Extractor can be merged with the main OCR results. The extractor MUST meet these requirements, or it will not replace the results from the Main OCR Profile.

The extractor must return a contiguous sequence of characters on a single line of text.

  • This is the most important consideration to keep in mind. This means you cannot merge OCR results if the extractor returns results on multiple lines. You cannot, for example, create an OCR Layer that merges the results of a full paragraph.
  • Data Type Collation Providers that produce results on multiple lines are not suitable for use, such as Arrays and Ordered Arrays in Vertical Mode.
  • Collation Providers that can produce non-contiguous output values, such as Arrays and Ordered Arrays, can potentially also cause some problems. Results will not merge properly unless the array elements are contiguous, meaning the text in the first array element should not skip characters between the next element (At that point a single regex pattern matching the text on the line is probably what you want).

FuzzyRegEx and FuzzyList match modes CAN be used to correct output. This can be a great way to repair labels recognized with minor OCR errors.

  • Keep in mind, however, the \r\n characters at the end of a line can be swapped in a fuzzy match. Be sure to list these characters as immutable in your Fuzzy Match Weightings, using the syntax Immutable=\r\n. This will prevent unintentional matches across line breaks.

DO NOT use fuzzy lexicon lookups, output formats, or lexicon translation to modify output values.

  • The result will not replace the main OCR results if you do. The FuzzyRegEx and FuzzyList match modes are the only ways Layered OCR can modify the results of the secondary OCR Profile before merging with the main OCR Profile's results.

Use Cases

Mixed Print Types

Layered OCR shines where you need to extract text from documents that use vastly different print types. One OCR Profile may find one print type, and another OCR Profile may find a second print type, but neither may find both. Layered OCR allows you to get the best out of both worlds, combining the results from both OCR Profiles to get a single OCR output that is more accurate than either of the profiles individually.

As we saw in the check above, Tesseract found the MICR line (using the MICR font) well, but Transym did a better job at recognizing the rest of the fonts on the document. Layered OCR allowed us to get highly accurate results using both engines. See the How To section of this article for a step-by-step guide to how this was accomplished.

Label Repair

Because FuzzyRegEx and Fuzzy List match modes are supported for Layer Extractors, Layered OCR can be used to correct OCR results before the document gets handed off to a data extraction step. For example, the label "Order Date" could be fuzzy matched against the Main OCR Profile's result "Order Dat3" and swap the "3" for an "e". This would yield the correct label "Order Date" in the layered OCR results. This will make data extraction simpler and quicker, as well as providing a document with more accurate searchable text data upon export.

Keep in mind, however, the \r\n characters at the end of a line can be swapped in a fuzzy match. Be sure to list these characters as immutable in your Fuzzy Match Weightings, using the syntax Immutable=\r\n. This will prevent unintentional matches across line breaks.

Custom Label Image Processing

One of the best tools in Grooper's toolbox to get accurate OCR results is the ability to perform highly configurable temporary image processing prior to OCR by setting an IP Profile on an OCR Profile. We can leverage this even further to our advantage by creating an OCR layer than uses a different set of image cleanup commands to produce better results on portions of a document than the Main OCR Profile's temporary IP Profile. See the How To section of this article for an example.

How To

Configure Layered OCR - Using two OCR Profiles with different OCR engines

Before you begin

For this tutorial, we will demonstrate how we merged the OCR data for a series of checks using results from two different OCR Profiles. One OCR Profile will return most of the document's text. A second will accurately return the MICR line (the routing, account, and check number).

This tutorial assumes you are familiar with creating and configuring OCR Profiles, as well as temporary image processing for OCR cleanup. It will use two OCR Profiles as its starting point. However they are fairly basic. One uses the Transym OCR engine. One uses Tesseract.


Both also use a basic IP Profile to perform temporary image cleanup. It has only three IP Steps. All of them use the default property settings.

  1. Binarize
  2. Negative Region Removal
  3. Line Removal

We've named this IP Profile "Checks - Temp"


The OCR Profile using Transym will be our Main OCR Profile, providing our baseline OCR results.

We've named it "Layered OCR - Checks - Main".

It simply has the OCR Engine property set to Transym OCR 4 and the IP Profile property set to our "Checks - Temp" IP Profile


The OCR Profile using Tesseract will be our OCR Layer profile. We will target the results from the MICR line recognized by this OCR Profile and merge them with the results from the "Layered - Checks - Main" OCR Profile.

We've named it "Layered OCR - Checks - MICR Line"

The following properties are configured:

  1. The OCR Engine property is set to Tesseract OCR.
  2. The IP Profile property is set to the "Checks - Temp" IP Profile we created.
  3. The Maximum Variance property is set to 15% (We will note why later).
  4. The Special Fonts property allows you to select any fonts you've trained Tesseract to recognize. Here, we've selected MICR. This is a trained font that ships with Grooper (Version 2.80 and later)

Set the Main OCR Profile

  1. Create a new OCR Profile. We've named ours "Layered OCR - Checks".
  2. Using the left property window (the Grooper OCR Settings), set the OCR Engine property to Layered OCR.
  3. Using the right property window (the OCR Engine Settings), set the Main OCR Profile to the OCR Profile you wish to use for the baseline OCR results. We named ours "Layered OCR - Checks - Main".

Set the OCR Layer(s)

  1. Select the Layers property and click the ellipsis button at the end.
  2. The "OCR Layer Collection Editor" window will pop up. Press the "Add" button.
  3. Using the OCR Profile property, select the OCR Profile you wish to use as your OCR layer, merging its results with the results of the main OCR Profile. We named ours "Layered OCR - Checks - MICR Line".

Set the Layer's Extractor

Next we need an extractor to tell the Layered OCR Profile what text data to merge from our OCR Layer with our main OCR results. This can be a simple Internal extractor or a Reference to a Data Type in the Node Tree. For our use case, a simple internal pattern will work. However, you can build out the extractor in the Node Tree and reference it, if you prefer.


Here, you can see our pattern. For illustrative purposes, we've run the OCR Layer Profile "Layered OCR - Checks - MICR Line" on this document so you can see the pattern matching the data we want to merge.

For these "checks" the value pattern A\d{9}A \d{7,9} C\d{4} matches these MICR lines. The weird |: character translates to an "A" and the ||' character translates to a "C".

Now that we know it matches the results from the OCR Layer's OCR Profile, we know this result will replace whatever text the main OCR Profile produces.


Press the "OK" button to finish setting up the OCR Layer.


And that's pretty much it as far as setup goes! We'll verify our results and take care of some housekeeping issues next.

Verify the Results

  1. Switch over to the "OCR Testing Tab" to verify our results.
  2. Select a page in the batch to test.
  3. Press the "OCR Page" button.

Now we have a single document using results from both OCR Profiles. The MICR line "A888888888A 8475309 C1958" was recognized by the OCR Layer's OCR Profile, returned by the OCR Layer's Extractor, and layered on top of the Main OCR Profile's results, replacing the text in that location.


What happens if the OCR Layer's Extractor does not match the results from the OCR Layer's text results?

If the extractor does not return a result, there won't be any results to layer. Layered OCR will only merge OCR results if that extractor returns a result.

For the document seen here, the OCR Layer's OCR Profile did not accurately recognize the MICR line. Therefore, its Extractor didn't match the text. Therefore, it did not merge with the Main OCR Profile's results. The Main OCR Profile's results are retained, seen here.

Other Considerations

If you are eagle-eyed, you may have noticed the spacing between the characters in the MCIR line is off. The end of the line should read "C1958" but it has some extra spaces in it so it reads "C 19 58". This could potentially cause some problems when it comes time to extract data. However, if we compare the final layered OCR results to the OCR Layer's OCR results (the source of the MICR line's text data), we can see before merging the results the OCR results are spaced correctly.

The results before merging are spaced correctly. The results after merging are not
This spacing issue is even more prevalent on other documents.


So what gives? Why does our OCR Layer's source text not match up with the final merged text?


This has to do with Grooper's OCR Synthesis properties. These are a unique set of properties used to pre-process and re-process OCR results to increase their accuracy.

You may recall, our OCR Layer's OCR Profile (named "Layered OCR - Checks - MICR Line"), we changed the Maximum Variance property to 15%.

This is one of Grooper's Synthesis properties. This property controls the allowable horizontal size difference between characters in a fixed width font. By increasing it, we effectively told the OCR to allow differences between characters in fixed width fonts to be a little bigger than normal. The outcome is the MICR line was spaced out correctly.


The OCR Profile using Layered OCR also has Synthesis properties, just like any other OCR Profile.

By default Synthesis is enabled on all OCR Profiles, and the Maximum Variance defaults to 12%.

Essentially, what has happened is our Layered OCR Profile has re-synthesized the synthesized OCR results form our OCR Layer.


When using Layered OCR be mindful of its own Synthesis and other profile settings.

You may want to disable these settings entirely in order to retain the synthesized results from your Main OCR Profile and OCR Layer Profiles.

Configure Layered OCR - Using two OCR Profiles with different temporary IP Profiles

Some of the tabs in this tutorial are longer than the others. Please scroll to the bottom of each step's tab before going to the step.

Before you begin

This tutorial assumes you are familiar with creating and configuring OCR Profiles, as well as temporary image processing for OCR cleanup, and have reviewed the previous tutorial.


The problem we are trying to address has to do with particularly bad results from some labels on one of these checks.

Notice, for this document, the "Date" "Check No." and "Amount" labels did not come through at all.


The issue has to do with the fact this particular font is inside a negative region (white text on a black background).

Since OCR needs to see black pixels in order to recognize them, we inverted this region, giving us black text. However, we didn't get quite enough of the text after the transformation. |

The Original Image After Temporary IP Performed

|- | Layered OCR allows us to do is to use a different temporary IP Profile to perform image cleanup on these labels, giving us accurate OCR results. |

A different temporary IP Profile applied Accurate results for the labels


The temporary IP Profile used here is a modification of the one used for the rest of these checks. The main difference is its Binarize step. Instead of using the Auto Thresholding Method, this one uses Simple with the Threshold manually set to 38. This low threshold will preserve more of the white text on the document. For most of the text, this is going to be a bad thing, as the light pixels around the black text are going to degrade those characters. However, since these labels ("Date" "Check No." and "Amount") are white text inside a black border, they will actually be preserved better once we invert them with a Negative Region Removal step. And, since we are only targeting these labels in our OCR Layer, it doesn't matter what happens to the rest of the text because it won't be merged with the main text data.

The steps in this IP Profile (which we've named "Checks - Inverted Labels") are as follows.

  1. Binarize
    • Thresholding Method set to Simple
    • Threshold set to 38
  2. Negative Region Removal
  3. Line Removal
  4. If you want, you may add Speck Removal
    • For some additional artifact cleanup around the inverted labels.

Create the OCR Layer's OCR Profile

For this OCR Layer, we're just going to copy the "Layered OCR - Checks - Main" OCR Profile and change the IP Profile property.

  1. Here we've copied the "Layered OCR - Checks - Main" OCR Profile and named it "Layered OCR - Checks - Inverted Labels".
  2. Select the IP Profile property. Using the dropdown menu, select the IP Profile referred to in the previous step named "Checks - Inverted Labels"

Add a New OCR Layer

  1. In the Node Tree, select the OCR Profile using Layered OCR
    • The one we've been using is named "Layered OCR - Checks"
  2. Select the Layers property in the right-hand property window
  3. Press the ellipsis button at the end. This will bring up the "OCR Layer Collection Editor".


  1. Press the "Add" button.
  2. Select the OCR Profile property. Click the drop-down menu.
  3. Select the "Layered OCR - Checks - Inverted Labels" OCR Profile.

Configure the OCR Layer's Extractor

Next, we need to determine which text will be merged. The OCR Layer's Extractor property allows us to use an Internal or Reference extractor to return text from the OCR Layer's OCR results and merge it with the main OCR Profile's results.


For this example a simple pattern looking for these labels will work just fine. We will use date check no\. amount for the value pattern.


  1. We've set that pattern as an Internal extractor for this layer.
  2. Press the "OK" button to finish configuring the OCR Layer.

Verify the Results

  1. Switch over to the "OCR Testing Tab" to verify our results.
  2. Select a page in the batch to test.
  3. Press the "OCR Page" button.

Now we have a single document using results from our main OCR Profile merged with the results from both OCR Layers. The MICR line "A987654321A 987654321 C1492" was merged from the results of the first OCR Layer (configured in the previous tutorial) and the "Date" Check No." and "Amount" labels were merged from the second OCR Layer, which also uses Transym as its OCR engine, but uses a different temporary IP Profile to better target those labels.