2.80:Layered OCR (OCR Engine): Difference between revisions

Revision as of 13:29, 9 July 2020

Layered OCR enables you to run secondary OCR Profiles on a single page. The OCR results from these secondary OCR Profiles are merged with (or layered on top of) the primary OCR Profile's results.

About

You can use Layered OCR by selecting it as your OCR Engine in an OCR Profile. While not itself an OCR Engine, such as Transym or Tesseract, it allows you to obtain OCR text with multiple OCR Profiles, each using their own OCR engines.

For example, certain OCR engines have advantages over others in specific cases. Transym performs well in most cases. However, it does not do well with certain specialized fonts and print types, such as MICR or handwriting. Another engine may perform better in these cases. Microsoft's Azure Computer Vision does better than most OCR engines at recognizing handwriting (but requires a licence key from Microsoft). Google's Tesseract has the capability to train fonts to target atypical fonts and print types.

FYI

Grooper ships with both Transym and Tesseract as selectable OCR engines. Furthermore, training files for the MICR, OCR-A, and OCR-B fonts are included.

For the check below, an OCR Profile using Transym performed well generally, but failed to read the MICR line at the bottom (the routing, account, and check numbers). Tesseract got the MICR line, but had issues recognizing other parts of the check.

Transym accurately reads the check's text except for the MICR Line	Tesseract has issues with other parts of the check.

With Layered OCR you can use an OCR Profile using Transym as your primary or baseline OCR results (seen in teal), and target the MICR line with an extractor to pull the results from an OCR Profile using Tesseract (seen in orange).

This can greatly improve your OCR results. The secondary layers can target segments of text better recognized by different OCR Profiles and merge the results with your main OCR Profile.

How It Works

Layered OCR has three basic steps.

The Main OCR Profile property establishes the primary OCR Profile. Here, you will point to a configured OCR Profile you want to use as your baseline OCR.
The Layers property allows you to use secondary OCR Profiles. Here, you will add one more more Layers pointing to a second configured OCR Profile and an Extractor.
The Extractor returns segments of text recognized by the secondary (or layer) OCR Profiles, and replaces the results from the Main OCR Profile.

Layer Extractor Requirements and Considerations

!	There are some specific requirements for what results from a Layer Extractor can be merged with the main OCR results. The extractor MUST meet these requirements, or it will not replace the results from the Main OCR Profile.

The extractor must return a contiguous sequence of characters on a single line of text.

This is the most important consideration to keep in mind. This means you cannot merge OCR results if the extractor returns results on multiple lines. You cannot, for example, create an OCR Layer that merges the results of a full paragraph.
Data Type Collation Providers that produce results on multiple lines are not suitable for use, such as Arrays and Ordered Arrays in Vertical Mode.
Collation Providers that can produce non-contiguous output values, such as Arrays and Ordered Arrays, can potentially also cause some problems. Results will not merge properly unless the array elements are contiguous, meaning the text in the first array element should not skip characters between the next element (At that point a single regex pattern matching the text on the line is probably what you want).

FuzzyRegEx and FuzzyList match modes CAN be used to correct output. This can be a great way to repair labels recognized with minor OCR errors.

Keep in mind, however, the \r\n characters at the end of a line can be swapped in a fuzzy match. Be sure to list these characters as immutable in your Fuzzy Match Weightings, using the syntax Immutable=\r\n. This will prevent unintentional matches across line breaks.

DO NOT use fuzzy lexicon lookups, output formats, or lexicon translation to modify output values.

The result will not replace the main OCR results if you do. The FuzzyRegEx and FuzzyList match modes are the only ways Layered OCR can modify the results of the secondary OCR Profile before merging with the main OCR Profile's results.

Use Cases

Mixed Print Types

Layered OCR shines where you need to extract text from documents that use vastly different print types. One OCR Profile may find one print type, and another OCR Profile may find a second print type, but neither may find both. Layered OCR allows you to get the best out of both worlds, combining the results from both OCR Profiles to get a single OCR output that is more accurate than either of the profiles individually.

As we saw in the check above, Tesseract found the MICR line (using the MICR font) well, but Transym did a better job at recognizing the rest of the fonts on the document. Layered OCR allowed us to get highly accurate results using both engines. See the How To section of this article for a step-by-step guide to how this was accomplished.

Label Repair

Because FuzzyRegEx and Fuzzy List match modes are supported for Layer Extractors, Layered OCR can be used to correct OCR results before the document gets handed off to a data extraction step. For example, the label "Order Date" could be fuzzy matched against the Main OCR Profile's result "Order Dat3" and swap the "3" for an "e". This would yield the correct label "Order Date" in the layered OCR results. This will make data extraction simpler and quicker, as well as providing a document with more accurate searchable text data upon export.

!	Keep in mind, however, the `\r\n` characters at the end of a line can be swapped in a fuzzy match. Be sure to list these characters as immutable in your Fuzzy Match Weightings, using the syntax `Immutable=\r\n`. This will prevent unintentional matches across line breaks.

Custom Label Image Processing

One of the best tools in Grooper's toolbox to get accurate OCR results is the ability to perform highly configurable temporary image processing prior to OCR by setting an IP Profile on an OCR Profile. We can leverage this even further to our advantage by creating an OCR layer than uses a different set of image cleanup commands to produce better results on portions of a document than the Main OCR Profile's temporary IP Profile. See the How To section of this article for an example.

How To

Grooper Help Documentation

Version Differences

Layered OCR is a new feature added in Grooper version 2.72. Prior to this point, this functionality did not exist.

@@ Line 11: / Line 11: @@
-For example, certain OCR engines have advantages over others in specific cases.  Transym performs well in most cases.  However, it does not do well with certain specialized print types, such as [https://en.wikipedia.org/wiki/Magnetic_ink_character_recognition MICR] or handwriting.  Another engine may perform better in these cases.  Microsoft's [https://azure.microsoft.com/en-us/services/cognitive-services/computer-vision/ Azure Computer Vision] does better than most OCR engines at recognizing handwriting (but requires a licence key from Microsoft).  Google's [https://en.wikipedia.org/wiki/Tesseract_(software) Tesseract] has the capability to train fonts.  Grooper ships with both Transym and Tesseract as selectable OCR engines.  Furthermore, training files for the MICR, OCR-A, and OCR-B fonts are included.
+For example, certain OCR engines have advantages over others in specific cases.  Transym performs well in most cases.  However, it does not do well with certain specialized fonts and print types, such as [https://en.wikipedia.org/wiki/Magnetic_ink_character_recognition MICR] or handwriting.  Another engine may perform better in these cases.  Microsoft's [https://azure.microsoft.com/en-us/services/cognitive-services/computer-vision/ Azure Computer Vision] does better than most OCR engines at recognizing handwriting (but requires a licence key from Microsoft).  Google's [https://en.wikipedia.org/wiki/Tesseract_(software) Tesseract] has the capability to train fonts to target atypical fonts and print types.
+{|cellpadding="10" cellspacing="5"
+|-style="background-color:#36b0a7; color:white"
+|style="font-size:14pt"|'''FYI'''||Grooper ships with both Transym and Tesseract as selectable OCR engines.  Furthermore, training files for the MICR, OCR-A, and OCR-B fonts are included.
+|}
 {|style="margin:auto; text-align:center" cellpadding=10 cellspacing=5
-|colspan=2|For the check below, an OCR Profile using Transym performed well generally, but failed to read the MICR line at the bottom.  Tesseract got the MICR line, but had issues recognizing other parts of the check.
+|colspan=2|For the check below, an '''OCR Profile''' using Transym performed well generally, but failed to read the MICR line at the bottom (the routing, account, and check numbers).  Tesseract got the MICR line, but had issues recognizing other parts of the check.
 |-
 |colspan=2|[[file:Layered OCR Checks 01.png]]
@@ Line 23: / Line 28: @@
 |[[file:Layered OCR Checks 02.png|border]]||[[file:Layered OCR Checks 03.png|border]]
 |-
-|colspan=2|With ''Layered OCR'' you can use an OCR Profile using Transym as your primary or baseline OCR results (seen in teal), and target the MICR line with an extractor to pull the results from an OCR Profile using Tesseract (seen in orange).
+|colspan=2|With ''Layered OCR'' you can use an '''OCR Profile''' using Transym as your primary or baseline OCR results (seen in teal), and target the MICR line with an extractor to pull the results from an '''OCR Profile''' using Tesseract (seen in orange).
 |-
 |colspan=2|[[file:Layered OCR Checks 04.png|border]]
@@ Line 29: / Line 34: @@
-This can greatly improve your OCR results.  The secondary layers can target segments of text better recognized by different OCR Profiles and merge the results with your main OCR Profile.
+This can greatly improve your OCR results.  The secondary layers can target segments of text better recognized by different '''OCR Profiles''' and merge the results with your main '''OCR Profile'''.
 === How It Works ===
@@ Line 35: / Line 40: @@
 Layered OCR has three basic steps.
-# The '''''Main OCR Profile''''' property establishes the primary OCR Profile.  Here, you will point to a configured OCR Profile you want to use as your baseline OCR.
+# The '''''Main OCR Profile''''' property establishes the primary OCR Profile.  Here, you will point to a configured '''OCR Profile''' you want to use as your baseline OCR.
-# The '''''Layers''''' property allows you to use secondary OCR Profiles.  Here, you will add one more more Layers pointing to a second configured OCR Profile and an Extractor.
+# The '''''Layers''''' property allows you to use secondary OCR Profiles.  Here, you will add one more more '''''Layers''''' pointing to a second configured '''OCR Profile''''' and an '''Extractor'''.
-# The Extractor returns segments of text recognized by the secondary (or layer) OCR Profiles, and replaces the results from the Main OCR Profile.
+# The '''''Extractor''''' returns segments of text recognized by the secondary (or layer) '''OCR Profiles''', and replaces the results from the '''''Main OCR Profile'''''.
 === Layer Extractor Requirements and Considerations ===
@@ Line 44: / Line 49: @@
 {|cellpadding="10" cellspacing="5"
 |-style="background-color:#f89420; color:white"
-|style="font-size:14pt"|'''!'''||There are some specific requirements for what results from a Layer Extractor can be merged with the main OCR results.  The extractor MUST meet these requirements, or it will not replace the results from the Main OCR Profile.
+|style="font-size:14pt"|'''!'''||There are some specific requirements for what results from a '''''Layer Extractor''''' can be merged with the main OCR results.  The extractor MUST meet these requirements, or it will not replace the results from the '''''Main OCR Profile'''''.
 |}
 The extractor must return a '''contiguous sequence of characters''' on a '''single line of text.'''
-* This is the most important consideration to keep in mind.  This means you '''cannot''' merge OCR Results if the extractor returns results on multiple lines.  You cannot, for example, create an OCR Layer that merges the results of a full paragraph.
+* This is the most important consideration to keep in mind.  This means you '''cannot''' merge OCR results if the extractor returns results on multiple lines.  You cannot, for example, create an OCR Layer that merges the results of a full paragraph.
-* Data Type Collation Providers that produce results on multiple lines are not suitable for use, such as Arrays and Ordered Arrays in Vertical Mode.
+* '''Data Type''' '''''Collation Providers''''' that produce results on multiple lines are not suitable for use, such as Arrays and Ordered Arrays in Vertical Mode.
-* Collation Providers that can produce non-contiguous output values, such as Arrays and Ordered Arrays (potentially), can also cause some problems.  Results will not merge properly unless the array elements are '''contiguous''', meaning the text in the first array element should not skip characters between the next element (At that point a single regex pattern matching the text on the line is probably what you want).
+* '''''Collation Providers''''' that can produce non-contiguous output values, such as ''Arrays'' and ''Ordered Arrays'', can potentially also cause some problems.  Results will not merge properly unless the array elements are '''contiguous''', meaning the text in the first array element should not skip characters between the next element (At that point a single regex pattern matching the text on the line is probably what you want).
-FuzzyRegEx and FuzzyList modes '''CAN''' be used to correct output.  This can be a great way to repair labels recognized with minor OCR errors.
+''FuzzyRegEx'' and ''FuzzyList'' match modes '''CAN''' be used to correct output.  This can be a great way to repair labels recognized with minor OCR errors.
 * Keep in mind, however, the <code>\r\n</code> characters at the end of a line can be swapped in a fuzzy match.  Be sure to list these characters as immutable in your Fuzzy Match Weightings, using the syntax <code>Immutable=\r\n</code>.  This will prevent unintentional matches across line breaks.
 '''DO NOT''' use fuzzy lexicon lookups, output formats, or lexicon translation to modify output values.
-* The result will not replace the main OCR results if you do.  The FuzzyRegEx and FuzzyList match modes are the only ways Layered OCR can modify the results of the secondary OCR Profile before merging with the main OCR Profile's results.
+* The result will not replace the main OCR results if you do.  The ''FuzzyRegEx'' and ''FuzzyList'' match modes are the only ways ''Layered OCR'' can modify the results of the secondary OCR Profile before merging with the main '''OCR Profile's''' results.
+== Use Cases ==
+=== Mixed Print Types ===
+''Layered OCR'' shines where you need to extract text from documents that use vastly different print types.  One '''OCR Profile''' may find one print type, and another OCR Profile may find a second print type, but neither may find both.  ''Layered OCR'' allows you to get the best out of both worlds, combining the results from both '''OCR Profiles''' to get a single OCR output that is more accurate than either of the profiles individually.
+As we saw in the check above, Tesseract found the MICR line (using the MICR font) well, but Transym did a better job at recognizing the rest of the fonts on the document.  Layered OCR allowed us to get highly accurate results using both engines.  See the [[#How To|How To]] section of this article for a step-by-step guide to how this was accomplished.
+=== Label Repair ===
+Because ''FuzzyRegEx'' and ''Fuzzy List'' match modes are supported for Layer Extractors, Layered OCR can be used to correct OCR results ''before'' the document gets handed off to a data extraction step.  For example, the label "Order Date" could be fuzzy matched against the '''''Main OCR Profile's''''' result "Order Dat3" and swap the "3" for an "e".  This would yield the correct label "Order Date" in the layered OCR results.  This will make data extraction simpler and quicker, as well as providing a document with more accurate searchable text data upon export.
+{|cellpadding="10" cellspacing="5"
+|-style="background-color:#f89420; color:white"
+|style="font-size:14pt"|'''!'''||Keep in mind, however, the <code>\r\n</code> characters at the end of a line can be swapped in a fuzzy match.  Be sure to list these characters as immutable in your Fuzzy Match Weightings, using the syntax <code>Immutable=\r\n</code>.  This will prevent unintentional matches across line breaks.
+|}
+=== Custom Label Image Processing ===
+One of the best tools in Grooper's toolbox to get accurate OCR results is the ability to perform highly configurable temporary image processing prior to OCR by setting an '''IP Profile''' on an '''OCR Profile'''.  We can leverage this even further to our advantage by creating an OCR layer than uses a different set of image cleanup commands to produce better results on portions of a document than the '''''Main OCR Profile's''''' temporary '''IP Profile'''.  See the [[#How To|How To]] section of this article for an example.
+== How To ==
+== Grooper Help Documentation ==
+* [http://grooper.bisok.com/Documentation/2.80/Main/HTML5/index.htm#t=Object_Reference%2FOCR_Engine%2FLayered_OCR.htm Layered OCR]
+* [http://grooper.bisok.com/Documentation/2.80/Main/HTML5/index.htm#t=Object_Reference%2FOther%2FOCR_Layer.htm OCR Layer]
+== Version Differences ==
+''Layered OCR'' is a new feature added in Grooper version 2.72.  Prior to this point, this functionality did not exist.