OCR (Concept): Difference between revisions

From Grooper Wiki
No edit summary
 
(29 intermediate revisions by 3 users not shown)
Line 1: Line 1:
<blockquote style="font-size:14pt">
{{Migrated}}
OCR stands for Optical Character Recognition.  It allows text from paper documents to be digitized to be searched or edited by other software applications.  OCR converts typed or printed text from digital images of physical documents into machine readable, encoded text.  This conversion allows Grooper to search text characters from the image, providing the capability to separate images into documents, classify them and extract data from them.
{{2023:{{PAGENAME}}}}
</blockquote>
 
== About ==
 
The quick explanation of OCR is it analyzes pixels on an image and translates those pixels into text.  Most importantly, it translates pixels into ''machine readable'' text.  Grooper can be described as a document modeling platform.  You use the platform to model how pages are separated out into documents, how one document gets put into one category or another, and how extractable data is structured on the document.  Once you have this model of what a document is, how it fits into a larger document set, and where the data is on it, you can use it to programmatically process any document that fits the model.
 
In order to do any of that, you have to be able to read the text on the page.  How do you know an invoice is an invoice?  A simple way could be locating the word "invoice" (or other text associated with the invoice).  You, as a human, do this by looking at the ink on a page (or pixels for a digital document) and reading the word "invoice".  Grooper does this by using a Data Extractor (and regular expression) to read the machine readable text for the page.  OCR is how each page gets that machine readable text in order to model the document set and process it.
 
=== The General Process ===
 
In Grooper, OCR is performed by the [[Recognize]] activity, referencing an OCR Profile which contains all the settings to get the OCR results, including which [[OCR Engine]] is used.  The OCR Profile also has settings to optionally process those results to increase the accuracy of the [[OCR Engine]] used.  The general process of OCR'ing a document is as follows in Grooper:
 
1) The document image is handed to the [[Recognize]] activity, which references an OCR Profile, containing the settings to perform the OCR operation.
 
2) The [[OCR Engine]] (set on the OCR Profile) converts the pixels on the image into machine readable text for the full page.
 
3) Grooper reprocesses the [[OCR Engine]]'s results and runs additional OCR passes using the OCR Profile's Synthesis properties.
 
4) The raw OCR results from the [[OCR Engine]] and Grooper's Synthesis results are combined into a single text flow.
 
5) Undesirable results can be filtered out using Grooper's Results Filtering options.
 
{|style="margin:auto" cellpadding=10
|rowspan=2|
{|
|+The [[Recognize]] activity is handed the document image and performs OCR.
|[[File:Ocr results 1.png|border|500px]]
|}
|valign=top|
{|
{|
|+The results are seen here in a text flow.
|[[File:Ocr results 2.png|border|500px]]
|}
|
{|
|+The results are seen here in a "Layout View", using the character positions and font sizes obtained to overlay where they are on the document.
|[[File:Ocr results 3.png|border|500px]]
|}
|}
 
=== OCR vs. Native Text ===
 
OCR gets text specifically from images, whether they were printed and scanned or imported from a digital source.  However, what if the document created digitally and imported in its original digital form?  Wouldn't it have been created on a computer, using machine readable text?  Most likely, yes!  If a form was created using a product like Adobe Acrobat and filled in using a computer, the text comprising the document and the filled fields is encoded within the document itself.  This is called "Native Text".  This text is already machine readable.  So there is no reason to OCR the document.  Instead, the native text is extracted via Grooper's native text extraction.  Native text has a number of advantages over OCR.  OCR is not perfect.  As you will see, OCR is a fairly complicated process with a number of opportunities to misread a document.  Grooper has plenty of advancements to get around these errors and produce a better result, but OCR will rarely be as accurate as the original native text from a digital document.
 
However, be careful.  Just because a PDF document has machine readable text behind it, does not mean that text is native text.  If the document was OCR'd by a different platform, the text may have been inserted into the PDF (Grooper also has this capability upon exporting document).  In these cases, we still recommend OCR'ing the document to take advantage of Grooper's superior OCR capabilities and get a more accurate result.
 
Regardless whether getting machine readable text through OCR or Native Text Extraction, both are done via the [[Recognize]] activity.  In the case of OCR, you will need to create an OCR Profile containing all the settings to perform OCR and reference it during the [[Recognize]] activity.  Native Text Extraction is enabled by default, but can be disabled if you wish to use OCR instead.
 
=== What is an OCR Engine? ===
 
[[OCR Engine]]s are software applications that perform the actual recognition of characters on images, analyzing the pixels on the image and figuring out what text characters they match.
 
[[OCR Engine]]s themselves have three phases:
 
1) '''Pre-Processing''':  In this phase, the OCR image prepares the image to be read by turning color and grayscale images to black and white and potentially removing artifacts getting in the way of OCR, such as specks and lines.  Text is also segmented from lines to words and finally characters in this phase as well.
 
2) '''Character Recognition''':  Here, the [[OCR Engine]] takes those pixel character segments, compares them to examples of character glyphs, and makes a decision about which machine readable text character matches the segment.
 
3) '''Post-Processing''':  Commercial [[OCR Engine]]s also analyze the OCR results and attempt to correct inaccurate results, such as performing basic spellchecking.
 
For more in depth information on how [[OCR Engine]]s work, visit the [[OCR Engine]] article.
 
The Transym 4 and Transym 5 OCR engines are included in Grooper's licensing.  Transym OCR 4 provides highly accurate English-only OCR while Transym OCR 5 provides multi-language OCR for 28 different languages.  Google's open source Tesseract engine is available in version 2.72 and beyond. ABBY FineReader, Prime OCR, and Azure OCR are also supported but require separate installations and separate licensing.
 
== Image Processing and OCR ==
 
Regardless of how good an [[OCR Engine]] is, OCR is very rarely perfect.  Characters can be segmented out wrong.  Artifacts such as table lines, check boxes or even just specks from image noise can interfere with breaking out character segments from words and lines.  Even when they are segmented out correctly, the [[OCR Engine]]'s character recognition can make the wrong decision about what the character is.
 
Image Processing (often abbreviated as "IP") can assist the OCR operation by providing a "cleaner" image to the [[OCR Engine]].  The general idea is to give the OCR engine just the text pixels, so that is all the engine needs to process.
 
{|style="margin:auto" cellpadding="10"
|-style="text-align:center"
|This image is much easier for OCR to process...||...than this image.
|-
|[[File:Ocr ip 1.png|border|400px]]||[[File:Ocr ip 2.png|400px]]
|}
 
 
Images altered using an [[IP Profile]], which contains a step by step list of IP Commands, each of which performs a specific alteration to the image.  IP Profiles are highly configurable.  There are multiple different IP Commands, each of which has its own configurable properties as well.  In the example above, the image was altered using an IP Profile with six steps, each step containing a different IP Command.
 
<tabs style="margin:20px">
<tab name="IP Steps" style="margin:20px">
This is the list of steps in this IP Profile, each one named for the IP Command used:  Auto Border Crop, Binarize, Shape Removal, Line Removal, Speck Removal, and Blob Removal
 
[[file:ip profile 1.png|center]]
 
{|cellpadding="10" cellspacing="5"
|-style="background-color:#f89420; color:white"
|style="font-size:14pt"|'''!'''||Order of operation matters!  The image is altered step by step, from the first to the last.  The first step hands the second step the results of its IP Command.  The second step runs using the mutated image ''not the original image''.  The second step then hands its result to the third step and so on and so on.
|}
</tab>
<tab name="Auto Border Crop" style="margin:20px">
For this IP Profile, the first step runs the "Auto Border Crop" command.  It crops the image, removing its border.
 
In this case, the border was actually part of the document.  Usually, borders appear around documents because of how they were scanned.  However, the goal is still the same.  Remove superfluous, non-text pixels interfering with the OCR operation.
 
{|cellpadding="10" style="margin:auto"
|-style="text-align:center"
|Before||After
|-
|[[file:ip profile 3.png|400px]]||[[file:ip profile 2.png|400px]]
|}
</tab>
<tab name="Binarize" style="margin:20px">
The second step runs the "Binarize" command.  The Binarize command turns the image black and white.
 
OCR requires a black and white image to analyze pixels and segment them out into characters.  It needs a binary representation of the image:  "Is text" or "is not text"  While [[OCR Engine]]s will binarize an image as part of their pre-processing phase, doing so in an IP Profile gives you control over how that operation is done.  There is more than one way to binarize an image, and you have no control over it if you let the OCR engine do it for you.
 
{|cellpadding="10" style="margin:auto"
|-style="text-align:center"
|Before||After
|-
|[[file:ip profile 2.png|400px]]||[[file:ip profile 4.png|border|400px]]
|}
</tab>
<tab name="Shape Removal" style="margin:auto">
The next step runs the "Shape Removal" command.  Shape Removal takes trained examples of shapes, in this case the company logo, locates them on a document and removes them.
 
While probably not strictly necessary for this example, removing the shape does give OCR one less thing the OCR engine has to look at when segmenting out characters and figuring out what characters these pixels should be.
 
{|cellpadding="10" style="margin:auto"
|-style="text-align:center"
|Before||A dropout mask is created for</br>the detected sample shape.||After
|-
|[[file:ip profile 4.png|border|300px]]||[[file:ip profile 5.png|border|300px]]||[[file:ip profile 6.png|border|300px]]
|}
</tab>
<tab name="Line Removal, Speck Removal, and Blob Removal" style="margin:20px">
The last three commands are Line Removal, Speck Removal, and Blob Removal.  These three commands find three different types of pixel artifacts (lines, specks and blobs) and remove them, further isolating pixels that are only text characters.
 
{|cellpadding="10" style="margin:auto"
|-style="text-align:center"
|colspan=3|'''Line Removal'''
|-style="text-align:center"
|Before||A dropout mask is created</br>for detected lines.||After
|-
|[[file:ip profile 6.png|border|300px]]||[[file:ip profile 7.png|border|300px]]||[[file:ip profile 8.png|border|300px]]
|}
 
{|cellpadding="10" style="margin:auto"
|-style="text-align:center"
|colspan=3|'''Speck Removal'''
|-style="text-align:center"
|Before||A dropout mask is created for small specks.</br>Here, mostly getting rid of dotted lines.||After
|-
|[[file:ip profile 8.png|border|300px]]||[[file:ip profile 9.png|border|300px]]||[[file:ip profile 10.png|border|300px]]
|}
 
 
{|cellpadding="10" style="margin:auto"
|-style="text-align:center"
|colspan=3|'''Blob Removal'''
|-style="text-align:center"
|Before||A dropout mask is created for</br>detected blobs of a defined size.||After
|-
|[[file:ip profile 10.png|border|300px]]||[[file:ip profile 11.png|border|300px]]||[[file:ip profile 12.png|border|300px]]
|}
</tab>
<tab name="Results" style="margin:20px">
You are left with an image that the [[OCR Engine]] can much easier break up into line, word and finally character segments, vastly improving the accuracy of character recognition.
 
{|style="margin:auto" cellpadding="10"
|-style="text-align:center
|A portion of OCR results without applying the IP Profile||The same results with the IP Profile applied.
|-
|[[file:ip profile 13.png|border|450px]]||[[file:ip profile 14.png|border|450px]]
|}
</tab>
</tabs>
 
 
However, for the example above, the IP Profile's result is drastically different from the original image.  While it certainly helps the OCR result, it's likely, at the end of the process, you want to export a document that looks more like the "before" picture than the "after".  Luckily, Image Processing can be performed in two ways:
 
# '''Permanent''' for archival purposes.
# '''Temporary''' for OCR cleanup.
 
=== Permanent ===
 
Permanent Image Processing is done via the Image Processing activity.  It is, as the name implies, a permanent alteration of the document's image.  The Image Processing activity will reference an IP Profile and permanently apply its IP Commands to the document images.  Once that image is changed, it is changed for the remainder of the Batch Process.  There is no going back!  IP Profiles used by the Image Processing activity should only use commands acceptable for final export.
 
The three categories of most commonly used IP Commands for permanent cleanup are
 
* Border Cleanup - These commands clean up border artifacts around an image by cropping the image or filling in the border with a given color.
* Color Adjustment - These commands adjust the color values of the image, including brightness, color saturation, and contrast.
* Image Transforms - These commands change the image's size and orientation.
 
Of the commands in those categories, there are one or two that are particularly common.
 
 
==== Border Cleanup ====
 
Auto Border Crop, Border Fill
 
==== Color Adjustment ====
 
Brightness Contrast, Contrast Stretch
 
==== Image Transforms ====
 
Auto Deskew, Auto Orient
 
 
=== Temporary ===
 
Temp text
 
== OCR Synthesis ==
 
Synthesis is Grooper's unique approach to getting better results from an OCR Engine.  Using Synthesis, portions of the document can be OCR'd independently from the full text OCR, portions of the image dropped out from the first OCR pass can be re-run, and certain results can be reprocessed.  The results from the Synthesis operation then get combined with the full text OCR results from the OCR Engine into a single text flow.
 
Synthesis is a collection of five separate OCR processing operations:
 
* Bound Region Processing
* Iterative Processing
* Cell Validation
* Segment Reprocessing
* Font Pitch Detection
 
As separate operations, the user can choose to enable all four operations, choose to use only one, or any combination.  Synthesis is enabled on OCR Profiles, using the "Synthesis" property.  This property is enabled by default on OCR Profiles (and can be disabled if you so choose).  However, each Synthesis operation needs to be configured independently in order to function.
 
 
[[File:Ocr synthesis 1.png|center|900px]]
 
 
[[File:Ocr synthesis 2.png|center|900px]]
 
 
The general idea behind each of these operations is to increase the accuracy of OCR results by narrowing the OCR Engine's "field of vision".  In general, the less the OCR Engine has to look at, the better the results will be.  Rather than expecting the OCR Engine to get highly specific character accuracy by looking at the whole image, each operation breaks up the image up in some way, allowing the OCR Engine to only focus on a portion of it.  The accuracy for that portion is then increased and the results are "synthesized" into a final, more accurate, result.
 
=== Bound Region Processing ===
 
[[file:bound region 1.png|frame|Bound Region Processing performs OCR on bound regions independently, such as text inside cells in a table]]
 
Bound Region Processing performs OCR independently on text inside regions fully enclosed within lines.  In other words, it processes text inside a box separately from the full page OCR.  This vastly improves the OCR results for text inside tables or a complex line structure.  By limiting OCR to just what is inside the box, the rest of the content on the page is not competing for the OCR Engine's attention, ultimately improving the result.
 
It does change how OCR runs quite a bit.  Bound Region Processing actually runs ''before'' full page OCR.  The order of operations is as follows:
 
1) Bound Region Detection - First, boxes are identified on the page.
* Box size can be configured using Bound Region Processing's properties.  There are also options to merge boxes of the same height and to ignore boxes that span accross the entire width of the page.  Since each box is OCR'd independently, this can reduce the number of total OCR operations, which will reduce the time it takes for Bound Region Processing to run.
* Bound Region Detection works from the original image, not an IP image (if created using the OCR Profile's "IP Profile" property).  So, it will ignore any Line Removal command applied during the temporary image pre-processing.
 
2) Bound Region OCR - After bound regions are identified, text within each bound region is OCR'd.
* Each region is OCR'd independently.  If there are ten boxes, there will be ten OCR operations, one for each box.
 
3) Bound Region Dropout - Since the contents of these regions have been OCR'd, these pixels are removed from the image used for full page OCR.  Grooper already has text
* Bound Region Processing is a one-two punch of OCR accuracy.  Not only does it improve the accuracy of text inside bound regions, it can also increase the accuracy of text ''outside'' bound regions.  Just like the rest of the image can interfere with the accuracy of OCR'd text inside the boxes, the boxes and text inside can interfere with the OCR'ing the other text on the page.  Dropping the bound region can give a bonus accuracy boost to the rest of the document.
 
4) Full Page OCR - The OCR Engine then runs on the resulting image, grabbing the rest of the text from the image.
 
5) Synthesis - Finally, the two results (the results inside bound regions + the results outside the bound regions) are merged together into a single text flow.
 
==== Configuring Bound Region Processing ====
 
<tabs style="margin:20px">
<tab name="Enable Bound Region Processing" style="margin:20px">
With Synthesis enabled, Bound Region Processing is a configurable property on OCR Profiles. 
 
Selecting an OCR Profile, navigate to the "Bound Region Processing" property and change it from "Disabled" to "Enabled"
 
 
[[File:Bound region 2.png|900px]]
 
</tab>
<tab name="Verify It's Working" style="margin:20px">
For the majority of cases, Bound Region Processing will successfully detect bound regions using the default properties.  You can verify this using the "OCR Testing Tab".
 
 
[[File:Bound region 3.png|900px]]
 
 
Select a page from the Test Batch and press the "OCR Page" button.  This will perform OCR on only the selected page to test out the OCR profile.
 
 
[[File:Bound region 4.png|900px]]
 
 
After you press the "OCR Page" page button, a new tab will appear underneath the image, the "Diagnostics" tab.  This tab has several images related to the OCR operation.  If Bound Region Processing was successful, you will see a "Bounded Regions" image.  Select that image.  All bounded regions will be highlighted in green and outlined in blue.
 
 
[[File:Bound region 5.png|900px]]
 
 
Also, FYI, the "Main OCR Input" image shows the original image with the text OCR'd from bound region processing dropped out.  This is what will be handed to the OCR Engine for full page OCR.  (Furthermore, if we ran a temporary IP Profile on this OCR Profile, we could easily get rid of those table lines as well, further increasing the efficacy of the OCR operation.)
 
 
[[File:Bound region 6.png|900px]]
 
</tab>
<tab name="Configure Properties (If Necessary)" style="margin:20px">
Bound Region Processing has several properties you can configure if necessary.  You can reveal its properties by double clicking "Bound Region Processing" in the OCR Profile, or pressing the carat button to the left of the property.
 
 
[[File:Bound region 7.png|900px]]
 
 
==== Properties affecting box size and detection ====
 
{|cellpadding=10 cellspacing=5 style="margin:auto"
|-style="background-color:#36B0A7; color:white"
|style="width:17%"|'''Property'''
|style="width:17%"|'''Default Value'''
|'''Information'''
|-style="background-color:#ddf5f5
|Minimum Size||6pt||This setting controls the minimum width or height of a box.  So, the default box size will detect a minimum of a box 6 pt wide by 1 pt high or 1 pt wide by 6 pt high.  That is a fairly small box.  If Bound Region Processing is detecting bound regions that aren't boxes, you may find it useful to increase the size of this property.
|-style="background-color:#ddf5f5
|Minimum Area||12pt||This setting controls the minimum area of a box.  This works in combination with the "Minimum Size" property to control which boxes are detected.  So, even though the Minimum Size default is 6 pt, Bound Region Processing won't actually detect a 6 pt wide by 1 pt high box, because its area is only 6 pt (6 x 1 = 6 and 6 < 12).  Similarly this property is helpful to narrow down which bound regions should be included in the Bound Region Processing operation.
|-style="background-color:#ddf5f5
|Maximum Width Ratio||75%||This property controls the maximum width of a single box based on it's size corresponding to the whole page.  At 75%, a single box will not be detected if it is larger than three quarters of the width of the whole page.  If you want to detect boxes of any width, even if they span the full width of the page, you will set this property to 100%.
|-style="background-color:#ddf5f5
|Maximum Height||1in||Here, you can limit the maximum height of a box.  If you wish to detect boxes of any height, change this property to "0".  This property also interacts with the "Always Allow Landscape" property.  See below for important information on how they interact.
|-style="background-color:#ddf5f5
|Always Allow Landscape||True||By default, boxes that are longer than they are high (having a "landscape" instead of "portrait" orientation) are ''exempted'' from exclusion if they are higher than the the "Maximum Height" value.  Only boxes that are narrower than they are high ("portrait" orriented boxes) will be excluded from Bound Region Processing.  If you are attempting to remove boxes that are longer than they are high from processing using the "Maximum Height" value, set this property to "False".
|-style="background-color:#ddf5f5
|Maximum Count||0||With this property set to "0" there will be no limit to the number of boxes detected.  If you do enter a maximum count value, bound region detection will stop once it finds one less than the maximum value (i.e.  If you enter a Maximum Count of "10" and there are 11 boxes on the page, only 9 bound regions will be detected.)
|}
 
==== The Merge Regions property ====
 
The Merge Regions property does not have to do with how regions are detected, but instead how those regions are processed.  When enabled, it will merge adjacent boxes next to each on a horizontal line as long as they are the same height.  Furthermore, they must themselves meet a height requirement in order to be merged, set by the "Maximum Merge Height" property.
 
This can speed up the time it takes Bound Region Processing to run by lowering the number of total OCR operations.  However, this does have the potential of negatively impacting the accuracy of the results in each cell.  Whether or not you choose to use this property will mostly depend on if you need to value the speed of the OCR operation over its accuracy.  This property is enabled by default.  You will need to disable it in order to see if it impacts the accuracy of Bound Region Processing.
 
[[File:Bound region 8.png|center]]
 
{|cellpadding=10 cellspacing=5 style="margin:auto"
|-style="background-color:#36B0A7; color:white"
|style="width:17%"|'''Property'''
|style="width:17%"|'''Default Value'''
|'''Information'''
|-style="background-color:#ddf5f5
|Merge Regions||True||This setting controls whether or not adjacent boxes of the same size are merged together.
|-style="background-color:#ddf5f5
|Maximum Merge Height||14pt||Adjacent boxes of the same height ''smaller'' than this value will be merge together.  If you wish to ignore a maximum merge height, merging all boxes of the same height on the same line regardless of size, enter "0" here.
|}
 
</tab>
</tabs>
 
=== Iterative Processing ===
 
Iterative Processing improves the OCR operation by performing a second pass at OCR.  After the OCR Engine performs full page OCR, characters recognized from the first pass are digitally dropped out.  Then, a second OCR pass is run on the resulting image.  This way characters that were ignored from the first pass can be isolated and recognized separately.  And the results are merged with the OCR results from the first pass.
 
{|style="margin:auto" cellpadding="10"
|-style="text-align:center"
|First, OCR runs on the full page.
|style="width:360px"|Recognized characters are digitally removed. A second OCR pass runs on the remaining portion of the image.
|-
|[[file:iterative ocr 1.png|border]]||[[file:iterative ocr 2.png|border]]
|}
 
==== Configuring Iterative Processing ====
 
<tabs style="margin:20px">
<tab name="Enable Iterative Processing" style="margin:20px">
With Synthesis enabled, "OCR Iterations" is a configurable property on OCR Profiles. 
 
Selecting an OCR Profile, navigate to the "OCR Iterations" property and change it from "1" to "2".  There can be a maximum of two OCR iterations.  Changing this property to "2" enables the second pass if the first pass skips over any characters.
 
Also note "OCR Iterations" does not have any configurable properties of its own.  It is either enabled or disabled with no further configuration necessary.
 
 
[[File:iterative ocr 3.png|900px]]
 
</tab>
<tab name="Verify It's Working" style="margin:20px">
You can verify if a second pass was run using the "OCR Testing Tab".
 
 
[[File:iterative ocr 4.png|900px]]
 
 
Select a page from the Test Batch and press the "OCR Page" button.  This will perform OCR on only the selected page to test out the OCR profile.
 
 
[[File:iterative ocr 5.png|900px]]
 
 
After you press the "OCR Page" page button, a new tab will appear underneath the image, the "Diagnostics" tab.  This tab has several images related to the OCR operation.  If Iterative Processing was successful, you will see an "IP Image", which is the first full text OCR iteration, and a "Second Iteration" which is the image used for the second pass, with all the previously recognized characters digitally dropped out.
 
 
[[File:iterative ocr 6.png|900px]]
 
 
[[File:iterative ocr 7.png|900px]]
 
 
{|cellpadding="10" cellspacing="5"
|-style="background-color:#f89420; color:white"
|style="font-size:14pt"|'''!'''||A second OCR pass is only done if portions of the image are not assigned a text character.  If all characters are recognized in the first pass, the second pass will just be given a blank image, and the second pass won't run.  In these cases, even with "OCR Iterations" enabled, you will not see a "Second Iteration" diagnostic image.
|}
 
</tab>
</tabs>
 
=== Cell Validation ===
 
[[File:Cell val 1.png|frame|Documents with columns greatly benefit from cell validation.  It allows each column to be OCR'd independently, greatly improving the results.]]
 
Cell Validation is particularly helpful for documents that have a columnar structure.

Latest revision as of 11:41, 28 August 2024

This article was migrated from an older version and has not been updated for the current version of Grooper.

This tag will be removed upon article review and update.

This article is about the current version of Grooper.

Note that some content may still need to be updated.

2025 20232.80

OCR is stands for Optical Character Recognition. It allows text on paper documents to be digitized, in order to be searched or edited by other software applications. OCR converts typed or printed text from digital images of physical documents into machine readable, encoded text.

This conversion allows Grooper to search text characters from the image, providing the capability to process these documents and the information they contain.

The Grooper activity that performs OCR is Recognize. Including a Recognize step in your Batch Process will allow you to OCR image-based content.

About

The quick explanation of OCR is it analyzes pixels on an image and translates those pixels into text. Most importantly, it translates pixels into machine readable text. Grooper can be described as a document modeling platform. You use the platform to model how pages are separated out into documents, how one document gets put into one category or another, and how extractable data is structured on the document. Once you have this model of what a document is, how it fits into a larger document set, and where the data is on it, you can use it to programmatically process any document that fits the model.

In order to do any of that, you have to be able to read the text on the page.

In a general sense, documents exist to communicate information to the reader. As human beings, we understand this information through the simple act of reading them, understanding the language they are written in. And we can separate loose pages into documents, differentiate between different types of documents, and parse the information on them. Grooper is ultimately going to do something very similar, using language as the fundamental unit of information, and regular expression as a way to parse that information. However, before we can get to that point, Grooper (or any other software) doesn't know the difference between a bunch of pixels and a bunch of text characters.

Once OCR is performed, Grooper will have a set of machine readable characters it can work with, instead of just a bunch of pixels. How do you know an invoice document is an invoice? A simple way could be locating the word "invoice" (or other text associated with the invoice). You, as a human, do this by looking at the ink on a page (or pixels for a digital document) and reading the word "invoice". Grooper does this by using a Data Extractor (and regular expression) to read the machine readable text for the page. OCR is how each page gets that machine readable text in order to model the document set and process it.

The General Process: How does Grooper OCR documents?

In Grooper, OCR is performed by the Recognize activity, referencing an OCR Profile which contains all the settings to get the OCR results, including which OCR Engine is used. The OCR Profile also has settings to optionally process those results to increase the accuracy of the OCR Engine used. The general process of OCR'ing a document is as follows in Grooper:

1) The document image is handed to the Recognize activity, which references an OCR Profile, containing the settings to perform the OCR operation.

2) The OCR Engine (set on the OCR Profile) converts the pixels on the image into machine readable text for the full page.

3) Grooper reprocesses the OCR Engine's results and runs additional OCR passes using the OCR Profile's Synthesis properties.

4) The raw OCR results from the OCR Engine and Grooper's Synthesis results are combined into a single text flow.

5) Undesirable results can be filtered out using Grooper's Results Filtering options.

The Recognize activity is handed the document image and performs OCR.
The results are seen here in a text flow.
The results are seen here in a "Layout View". Using the character positions and font sizes obtained during OCR, the results are overlaid where they are on the document.

OCR vs. Native Text

OCR gets text specifically from images, whether they were printed and scanned or imported from a digital source. However, what if the document was created digitally and imported in its original digital form? Wouldn't it have been created on a computer, using machine readable text? Most likely, yes! If a form was created using a product like Adobe Acrobat and filled in using a computer, the text comprising the document and the filled fields is encoded within the document itself. This is called "Native Text". This text is already machine readable. So there is no reason to OCR the document. Instead, the native text is extracted via Grooper's native text extraction. Native text has a number of advantages over OCR. OCR is not perfect. As you will see, OCR is a fairly complicated process with a number of opportunities to misread a document. Grooper has plenty of advancements to get around these errors and produce a better result, but OCR will rarely be as accurate as the original native text from a digital document.

However, be careful. Just because a PDF document has machine readable text behind it, does not mean that text is native text. If the document was OCR'd by a different platform, the text may have been inserted into the PDF (Grooper also has this capability upon exporting document). In these cases, we still recommend OCR'ing the document to take advantage of Grooper's superior OCR capabilities and get a more accurate result.

Regardless whether getting machine readable text through OCR or Native Text Extraction, both are done via the Recognize activity. In the case of OCR, you will need to create an OCR Profile containing all the settings to perform OCR and reference it during the Recognize activity. Native Text Extraction is enabled by default, but can be disabled if you wish to use OCR instead.


What is an OCR Engine?

OCR Engines are software applications that perform the actual recognition of characters on images, analyzing the pixels on the image and figuring out what text characters they match.

OCR Engines themselves have four phases:

1) Pre-Processing: In this phase, the OCR engine prepares the image to be read by turning color and grayscale images to black and white and potentially removing artifacts getting in the way of OCR, such as specks and lines.

2) Segmenting: Next, text pixels are broken up (or "segmented") into lines, then individual words and finally characters.

3) Character Recognition: Here, the OCR Engine takes those pixel character segments, compares them to examples of character glyphs, and makes a decision about which machine readable text character matches the segment.

4) Post-Processing: Commercial OCR Engines also analyze the OCR results and attempt to correct inaccurate results, such as performing basic spellchecking.

For more in depth information on how OCR Engines work, visit the OCR Engine article.

The Transym 4 and Transym 5 OCR engines are typically included in Grooper's licensing. Transym OCR 4 provides highly accurate English-only OCR while Transym OCR 5 provides multi-language OCR for 38 different languages. Google's open source Tesseract engine is available in version 2.72 and beyond. Microsoft's Azure OCR is also supported but requires separate licensing.

Image Processing and OCR

Regardless of how good an OCR Engine is, OCR is very rarely perfect. Characters can be segmented out from words wrong. Artifacts such as table lines, check boxes or even just specks from image noise can interfere with character segmenting and character recognition. Even when they are segmented out correctly, the OCR Engine's character recognition can make the wrong decision about what the character is.

Image Processing (often abbreviated as "IP") can assist the OCR operation by providing a "cleaner" image to the OCR Engine. The general idea is to give the OCR engine just the text pixels, so that is all the engine needs to process.

This image is much easier for OCR to process... ...than this image.


Images are altered using an IP Profile, which contains a step by step list of IP Commands, each of which performs a specific alteration to the image. IP Profiles are highly configurable. There are multiple different IP Commands, each of which has its own configurable properties as well. In the example above, the image was altered using an IP Profile with six steps, each step containing a different IP Command.

However, for the example above, the IP Profile's result is drastically different from the original image. While it certainly helps the OCR result, it's likely, at the end of the process, you want to export a document that looks more like the "before" picture than the "after". Luckily, Image Processing can be performed in two ways:

  1. Permanent for archival purposes.
  2. Temporary for non-destructive OCR cleanup.

For more information on Image Processing, visit the Image Processing article.

OCR Synthesis

The Synthesis functionality is Grooper's unique method of pre-processing and re-processing raw results from the OCR engine to get better results out of it. Using Synthesis, portions of the document can be OCR'd independently from the full text OCR, portions of the image dropped out from the first OCR pass can be re-run, and certain results can be reprocessed. The results from the Synthesis operation then get combined with the full text OCR results from the OCR Engine into a single text flow.

Synthesis is a collection of five separate OCR processing operations:

  • Font Pitch Detection
  • Bound Region Processing
  • Iterative Processing
  • Cell Validation
  • Segment Reprocessing

As separate operations, the user can choose to enable all four operations, choose to use only one, or any combination. Synthesis is enabled on OCR Profiles, using the "Synthesis" property. This property is enabled by default on OCR Profiles (and can be disabled if you so choose). However, each Synthesis operation needs to be configured independently in order to function.

For more information on each operation, visit the Synthesis article.

How do you configure OCR settings? - The OCR Profile

Now that we've talked a little bit about OCR in general, OCR engines, and some additional considerations such as image processing and Grooper's Synthesis settings, how do we tell Grooper how to execute all these considerations? That is done by creating and configuring an OCR Profile.

An OCR Profile is an object created in Grooper to store various settings controlling how OCR is performed.

This includes:

  • Setting which OCR Engine is used
  • Determining whether a temporary IP Profile is used for image cleanup before the OCR engine runs
  • Grooper's unique Synthesis settings
    • Determining if and how multiple OCR results are pre-processed and re-processed
  • If and how results are filtered, to toss out undesirable results.
  • Any configurable settings available from the OCR Engine


Below you will see one of the default OCR Profiles that ship with Grooper named "Full Text - Accurate", with these settings highlighted in each tab.

Here, you will list which OCR Engine will perform character recognition.

This OCR Profile is set to Transym OCR 4, using the Transym 4.0 OCR software to recognize characters.

One of the things that sets Grooper apart from other document processing platforms is the high degree of configuration options when it comes to image processing. The basic idea, here, is to give the OCR engine a "cleaned up" version of the document to use for OCR. When configured on an OCR Profile this is "temporary" in that the archival version of the document is not changed. Once OCR is finished, the document will revert to its original form. The image will only be altered for the purposes of obtaining OCR results.

These image processing settings are defined with a different type of profile called an IP Profile, which is then referenced by the OCR Profile's IP Profile property.

This OCR Profile uses a pre-built IP Profile called "OCR Cleanup"

Another thing that sets Grooper apart when it comes to OCR is our suite of Synthesis operations. These are different capabilities Grooper has to pre-process and re-process OCR results to improve the OCR engine's results.

This OCR Profile uses a variety of these Synthesis properties, all of which are highlighted in yellow. To learn more about this suite of properties, what they do, how they improve OCR results, and how to configure them, visit the Synthesis article.

The Result Filtering settings allow you to isolate certain characters and remove them from your results. Maybe you want to discard any characters that do not meet a minimum confidence score. Maybe you want to discard all characters below a certain font size. Maybe you want to discard all characters within a certain distance to the edge of the page. You can do those things (and more) using these Result Filtering settings.

This OCR Profile does not use any of settings. However, they are highlighted below.

Each OCR Engine has its own set of properties available to Grooper as well. These properties change from OCR engine to OCR engine, depending on which settings are exposed to Grooper from the OCR engine's software. However, they are always in the right window panel of the OCR Profile

This OCR Profile uses Transym 4.0, whose settings are seen in the highlighted portion.