OCR (Concept): Difference between revisions

From Grooper Wiki
No edit summary
No edit summary
 
(47 intermediate revisions by 3 users not shown)
Line 1: Line 1:
<blockquote style="font-size:14pt">
{{Migrated}}
OCR stands for Optical Character Recognition.  It allows text from paper documents to be digitized to be searched or edited by other software applications.  OCR converts typed or printed text from digital images of physical documents into machine readable, encoded text.  This conversion allows Grooper to search text characters from the image, providing the capability to separate images into documents, classify them and extract data from them.
{{2023:{{PAGENAME}}}}
</blockquote>
 
== About ==
 
The quick explanation of OCR is it analyzes pixels on an image and translates those pixels into text.  Most importantly, it translates pixels into ''machine readable'' text.  Grooper can be described as a document modeling platform.  You use the platform to model how pages are separated out into documents, how one document gets put into one category or another, and how extractable data is structured on the document.  Once you have this model of what a document is, how it fits into a larger document set, and where the data is on it, you can use it to programmatically process any document that fits the model.
 
In order to do any of that, you have to be able to read the text on the page.  How do you know an invoice is an invoice?  A simple way could be locating the word "invoice" (or other text associated with the invoice).  You, as a human, do this by looking at the ink on a page (or pixels for a digital document) and reading the word "invoice".  Grooper does this by using a Data Extractor (and regular expression) to read the machine readable text for the page.  OCR is how each page gets that machine readable text in order to model the document set and process it.
 
=== The General Process ===
 
In Grooper, OCR is performed by the [[Recognize]] activity, referencing an OCR Profile which contains all the settings to get the OCR results, including which OCR Engine is used.  The OCR Profile also has settings to optionally process those results to increase the accuracy of the OCR Engine used.  The general process of OCR'ing a document is as follows in Grooper:
 
1) The document image is handed to the Recognize activity, which references an OCR Profile, containing the settings to perform the OCR operation.
 
2) The OCR Engine (set on the OCR Profile) converts the pixels on the image into machine readable text for the full page.
 
3) Grooper reprocesses the OCR Engine's results and runs additional OCR passes using the OCR Profile's Synthesis properties.
 
4) The raw OCR results from the OCR Engine and Grooper's Synthesis results are combined into a single text flow.
 
5) Undesirable results can be filtered out using Grooper's Results Filtering options.
 
{|style="margin:auto" cellpadding=10
|rowspan=2|
{|
|+The Recognize activity is handed the document image and performs OCR.
|[[File:Ocr results 1.png|border|500px]]
|}
|valign=top|
{|
{|
|+The results are seen here in a text flow.
|[[File:Ocr results 2.png|border|500px]]
|}
|
{|
|+The results are seen here in a "Layout View", using the character positions and font sizes obtained to overlay where they are on the document.
|[[File:Ocr results 3.png|border|500px]]
|}
|}
 
=== OCR vs. Native Text ===
 
OCR gets text specifically from images, whether they were printed and scanned or imported from a digital source.  However, what if the document created digitally and imported in its original digital form?  Wouldn't it have been created on a computer, using machine readable text?  Most likely, yes!  If a form was created using a product like Adobe Acrobat and filled in using a computer, the text comprising the document and the filled fields is encoded within the document itself.  This is called "Native Text".  This text is already machine readable.  So there is no reason to OCR the document.  Instead, the native text is extracted via Grooper's native text extraction.  Native text has a number of advantages over OCR.  OCR is not perfect.  As you will see, OCR is a fairly complicated process with a number of opportunities to misread a document.  Grooper has plenty of advancements to get around these errors and produce a better result, but OCR will rarely be as accurate as the original native text from a digital document.
 
However, be careful.  Just because a PDF document has machine readable text behind it, does not mean that text is native text.  If the document was OCR'd by a different platform, the text may have been inserted into the PDF (Grooper also has this capability upon exporting document).  In these cases, we still recommend OCR'ing the document to take advantage of Grooper's superior OCR capabilities and get a more accurate result.
 
Regardless whether getting machine readable text through OCR or Native Text Extraction, both are done via the [[Recognize]] activity.  In the case of OCR, you will need to create an OCR Profile containing all the settings to perform OCR and reference it during the Recognize activity.  Native Text Extraction is enabled by default, but can be disabled if you wish to use OCR instead.
 
=== What is an OCR Engine? ===
 
OCR Engines are software applications that perform the actual recognition of characters on images, analyzing the pixels on the image and figuring out what text characters they match.
 
OCR Engines themselves have three phases:
 
'''Pre-Processing'''
 
First and foremost, OCR applications require a black and white image in order to determine what pixels on a page are text.  So, color and grayscale images must be converted to black and white.  This is done by a process called "thresholding" which determines a middle point between light pixels and dark pixels on the page.  Lighter pixels are then turned into white and darker ones are turned into black pixels.  You are left with only black and white pixels, with (ideally) all text in black and everything else faded into a white background. 
 
{|cellpadding="10" style="margin:auto"
|-style="text-align:center" valign="top"
|The original scanned image...
|style="width:400px"|...is turned black and white to divide the page into black pixels (text) and white pixels (the background).
|-
|[[file:threshold 1.png|400px]]||[[file:threshold 2.png|border|400px]]
|}
 
Some OCR Engines also contain de-skewing, despeckling, line removal, aspect ratio normalization, or other pre-processing functions to improve OCR results.
 
{|cellpadding="10" cellspacing="5"
|-style="background-color:#36b0a7; color:white"
|style="font-size:14pt"|'''FYI'''||Grooper has it's own pre-processing capabilities through its Image Processing operations.  OCR Engines typically place these pre-processing functions in a "black box" for users.  At best, the OCR Engine may allow you to turn the property "on" or "off" but may not allow you to configure it further to fine tune its results.    Custom Image Processing can be performed using IP Profiles made of highly configurable IP Commands.
|}
 
One of the most important aspects of pre-processing is "segmentation".  This is the process of breaking up a page into first lines, then words, and, finally, individual characters.
 
In general, this process involves distinguishing between text and the white space between text.  Lines of text are distinguished by the horizontal space between one line and another.  This can be seen using a histogram projection profile.
 
[[file:ocr segment 1.png|border|center]]
 
The gray peaks on the left side of the image show the amount of black pixels on the page.  The larger the peak, the larger the number of black pixels on that line.  OCR "sees" the line break where there are gaps are between those collections of pixels.
 
Words can be broken up in a similar way.  One expects a small amount of space between characters.  How we tell the difference between "rn" and "m", after all, is just that tiny amount of space between the "r" and "n".  Between words, however, that space should be a bit larger.  So, words are segmented at points where there are larger than normal amounts of white space between characters on a line.
 
[[file:ocr segment 3.png|border|center]]
 
In a perfect world, characters would be segmented out at this point as well.  After all, there is still some space between each character, just a little smaller than between each word.  You can easily see this with fixed-pitched fonts.  Fixed pitched fonts
 
 
However, the world of printed text is rarely that perfect. 
 
Looking at the image below, there is no white space between the "a" and "z" or "z" and "e" in "Hazel".  Just looking at the histogram projection, there's no break in the pixels to define where one character stops and another begins.  There's a slight break in the "n" in "Daniels". So, there is some white space in the middle of the character where there shouldn't be.  But, that shouldn't mean those are two separate characters.
 
[[File:Ocr segment 4.png|border|center]]
 
If the characters were separated out using the normal segmenting we've seen previously, we might expect a very poor result.  However, ultimately, we get the result we expect, "Hazel Daniels".
 
[[File:Ocr segment 5.png|border|center]]
 
Modern OCR Engines perform more sophisticated character level segmenting than just looking for small gaps between characters.  Characters connected by small artifacts can be isolated from each other and characters that are broken apart can be linked together.  This is done both by analyzing the peaks and valleys of pixel densities to determine where a gap "should" be as well as further segmenting the word to look at the context of portions of a character before and after to make a decision as to where one character starts and stops.
 
Once the OCR Engine has segmented the entire image into individual character segments, it can use character recognition to determine what character corresponds to that segment.  However, this is where a lot of OCR errors can crop up.  Depending on the quality of the image or the original document, characters can be joined and disconnected in many different ways.  The OCR Engine may not ''perfectly'' separate out one segment of a word as the "right" character.
 
{|cellpadding="10" cellspacing="5"
|-style="background-color:#36b0a7; color:white"
|style="font-size:14pt"|'''FYI'''||Some amount of OCR errors are unavoidable.  Document quality, scan quality, non-standard fonts and other issues can interfere with the OCR Engine producing 100% accurate results.  Part of Grooper's job is to massage the OCR Engine's results, through Image Processing, OCR Synthesis and Fuzzy Matching, into more accurate ones.
|}
 
'''Character Recognition'''
 
Once the OCR Engine parses out the image into lines, and then words, and finally character segments, it must make a decision about what text character that character segment actually is.  We're ready to do the "character recognition" part of "Optical Character Recognition".
 
There are two basic types of recognition algorithms: matrix matching and feature extraction.
 
Matrix matching compares a NxN matrix of pixels on a page to a library of stored character glyph examples.  This is also known as "pattern recognition" or "image correlation". 
 
{|cellpadding="10" style="margin:auto"
|-style="text-align:center" valign="top"
|style="width:150px"|The character on the document's image...
|style="width:150px"|...is compared to a stored example...
|style="width:300px"|...by comparing a matrix of pixels, between the character on the image and the stored example.
|}
{|cellpadding="10" style="margin:auto"
|[[file:ocr matrix 1.png|border]]||[[file:ocr matrix 2.png|border]]||[[file:ocr matrix 3.png|border]]||[[file:ocr matrix 4.png|border]]
|}
 
The OCR Engine then makes a decision about what character that matrix of pixels is.  In this case, a "G".  Matrix matching does have some pitfalls, however.  Because it is comparing text to the stored glyph pixel by pixel, the text needs to be similar to the stored glyph's font and scale in order to match.  While there may be hundreds of example gylphs of various fonts and scales for a single character, this can cause problems when matching text on poor quality images or using uncommon fonts.
 
The second type of recognition algorithm, feature extraction, decomposes characters into their component "features" like lines, line direction, line intersections or closed loops.  These features are compared to vector-like representations of a character, rather than pixel representations of the character.
 
{|cellpadding="10" style="margin:auto"
|-style="text-align:center" valign="top"
|style="width:260px"|Instead of pixels...
|style="width:260px"|...features matching how the character is drawn...
|style="width:260px"|...are compared to how those features are used to draw stored glyphs.
|}
{|cellpadding="10" style="margin:auto"
|[[file:ocr feature 1.png|border]]||[[file:ocr feature 2.png|border]]||[[file:ocr feature 3.png|border]]
|}
 
Just like with matrix matching, the OCR engine makes a decision about what character matches the extracted features.  In this case, again, a "G".  In engines using a combination of matrix matching and feature extraction, the results of both algoritms are combined to produce the best matching result.  Each character is given a "confidence score", which corresponds to how closely the character segment's pixels matched either the stored glyph's matrix or features or combination of the two.
 
This presents another layer of potential errors.  Given a document's quality, fonts used, and even the variable size of fonts on a page, the OCR Engine may recognize a character as the wrong glyph.
 
{|style="margin:auto"
|-style="text-align:center"
|What is this character?  Is it a "G"?  Is it a "C"?  Is it a "0"?  Is it an "O"?  Is it just garbage?
|-
|[[file:ocr bad.png|center]]
|}
 
The OCR Engine has to make a decision, which ultimately may not line up with what it is on the page.  Especially in situations like this, where it may be difficult for even a human being to read the character, the OCR Engine will have a hard time recognizing the character.
 
{|cellpadding="10" cellspacing="5"
|-style="background-color:#36b0a7; color:white"
|style="font-size:14pt"|'''FYI'''||Some amount of OCR errors are unavoidable.  Document quality, scan quality, non-standard fonts and other issues can interfere with the OCR Engine producing 100% accurate results.  Part of Grooper's job is to massage the OCR Engine's results, through Image Processing, OCR Synthesis and Fuzzy Matching, into more accurate ones.
|}
 
'''Post-Processing'''
 
Without any context around a character, these OCR errors can make sense.  The letter "a" and "o" can look fairly similar, especially using certain fonts.  However, the word "ballboy" is a real word, and "bollboy" is utter nonsense.
 
{|style="margin:auto"
|Similar characters may get misread by OCR...||...which becomes obvious given the context of that character inside a word.
|-
|[[File:Ocr post 1.png|border|center]]||[[File:Ocr post 2.png|border|center]]
|}
 
The most common post-processing done by OCR Engines is basic spell correction.  Often errors resulting poor character recognition result in small spelling mistakes.  For many OCR Engines (all commercial OCR Engines), results are compared to a lexicon of common English words, and replacements are made if possible.  Some OCR Engines even have grammar correction capabilities.
 
Note, this correction is only going to apply to words in the OCR Engine's lexicon.  Proper nouns (unless in the lexicon) will not be corrected.

Latest revision as of 11:41, 28 August 2024

This article was migrated from an older version and has not been updated for the current version of Grooper.

This tag will be removed upon article review and update.

This article is about the current version of Grooper.

Note that some content may still need to be updated.

2025 20232.80

OCR is stands for Optical Character Recognition. It allows text on paper documents to be digitized, in order to be searched or edited by other software applications. OCR converts typed or printed text from digital images of physical documents into machine readable, encoded text.

This conversion allows Grooper to search text characters from the image, providing the capability to process these documents and the information they contain.

The Grooper activity that performs OCR is Recognize. Including a Recognize step in your Batch Process will allow you to OCR image-based content.

About

The quick explanation of OCR is it analyzes pixels on an image and translates those pixels into text. Most importantly, it translates pixels into machine readable text. Grooper can be described as a document modeling platform. You use the platform to model how pages are separated out into documents, how one document gets put into one category or another, and how extractable data is structured on the document. Once you have this model of what a document is, how it fits into a larger document set, and where the data is on it, you can use it to programmatically process any document that fits the model.

In order to do any of that, you have to be able to read the text on the page.

In a general sense, documents exist to communicate information to the reader. As human beings, we understand this information through the simple act of reading them, understanding the language they are written in. And we can separate loose pages into documents, differentiate between different types of documents, and parse the information on them. Grooper is ultimately going to do something very similar, using language as the fundamental unit of information, and regular expression as a way to parse that information. However, before we can get to that point, Grooper (or any other software) doesn't know the difference between a bunch of pixels and a bunch of text characters.

Once OCR is performed, Grooper will have a set of machine readable characters it can work with, instead of just a bunch of pixels. How do you know an invoice document is an invoice? A simple way could be locating the word "invoice" (or other text associated with the invoice). You, as a human, do this by looking at the ink on a page (or pixels for a digital document) and reading the word "invoice". Grooper does this by using a Data Extractor (and regular expression) to read the machine readable text for the page. OCR is how each page gets that machine readable text in order to model the document set and process it.

The General Process: How does Grooper OCR documents?

In Grooper, OCR is performed by the Recognize activity, referencing an OCR Profile which contains all the settings to get the OCR results, including which OCR Engine is used. The OCR Profile also has settings to optionally process those results to increase the accuracy of the OCR Engine used. The general process of OCR'ing a document is as follows in Grooper:

1) The document image is handed to the Recognize activity, which references an OCR Profile, containing the settings to perform the OCR operation.

2) The OCR Engine (set on the OCR Profile) converts the pixels on the image into machine readable text for the full page.

3) Grooper reprocesses the OCR Engine's results and runs additional OCR passes using the OCR Profile's Synthesis properties.

4) The raw OCR results from the OCR Engine and Grooper's Synthesis results are combined into a single text flow.

5) Undesirable results can be filtered out using Grooper's Results Filtering options.

The Recognize activity is handed the document image and performs OCR.
The results are seen here in a text flow.
The results are seen here in a "Layout View". Using the character positions and font sizes obtained during OCR, the results are overlaid where they are on the document.

OCR vs. Native Text

OCR gets text specifically from images, whether they were printed and scanned or imported from a digital source. However, what if the document was created digitally and imported in its original digital form? Wouldn't it have been created on a computer, using machine readable text? Most likely, yes! If a form was created using a product like Adobe Acrobat and filled in using a computer, the text comprising the document and the filled fields is encoded within the document itself. This is called "Native Text". This text is already machine readable. So there is no reason to OCR the document. Instead, the native text is extracted via Grooper's native text extraction. Native text has a number of advantages over OCR. OCR is not perfect. As you will see, OCR is a fairly complicated process with a number of opportunities to misread a document. Grooper has plenty of advancements to get around these errors and produce a better result, but OCR will rarely be as accurate as the original native text from a digital document.

However, be careful. Just because a PDF document has machine readable text behind it, does not mean that text is native text. If the document was OCR'd by a different platform, the text may have been inserted into the PDF (Grooper also has this capability upon exporting document). In these cases, we still recommend OCR'ing the document to take advantage of Grooper's superior OCR capabilities and get a more accurate result.

Regardless whether getting machine readable text through OCR or Native Text Extraction, both are done via the Recognize activity. In the case of OCR, you will need to create an OCR Profile containing all the settings to perform OCR and reference it during the Recognize activity. Native Text Extraction is enabled by default, but can be disabled if you wish to use OCR instead.


What is an OCR Engine?

OCR Engines are software applications that perform the actual recognition of characters on images, analyzing the pixels on the image and figuring out what text characters they match.

OCR Engines themselves have four phases:

1) Pre-Processing: In this phase, the OCR engine prepares the image to be read by turning color and grayscale images to black and white and potentially removing artifacts getting in the way of OCR, such as specks and lines.

2) Segmenting: Next, text pixels are broken up (or "segmented") into lines, then individual words and finally characters.

3) Character Recognition: Here, the OCR Engine takes those pixel character segments, compares them to examples of character glyphs, and makes a decision about which machine readable text character matches the segment.

4) Post-Processing: Commercial OCR Engines also analyze the OCR results and attempt to correct inaccurate results, such as performing basic spellchecking.

For more in depth information on how OCR Engines work, visit the OCR Engine article.

The Transym 4 and Transym 5 OCR engines are typically included in Grooper's licensing. Transym OCR 4 provides highly accurate English-only OCR while Transym OCR 5 provides multi-language OCR for 38 different languages. Google's open source Tesseract engine is available in version 2.72 and beyond. Microsoft's Azure OCR is also supported but requires separate licensing.

Image Processing and OCR

Regardless of how good an OCR Engine is, OCR is very rarely perfect. Characters can be segmented out from words wrong. Artifacts such as table lines, check boxes or even just specks from image noise can interfere with character segmenting and character recognition. Even when they are segmented out correctly, the OCR Engine's character recognition can make the wrong decision about what the character is.

Image Processing (often abbreviated as "IP") can assist the OCR operation by providing a "cleaner" image to the OCR Engine. The general idea is to give the OCR engine just the text pixels, so that is all the engine needs to process.

This image is much easier for OCR to process... ...than this image.


Images are altered using an IP Profile, which contains a step by step list of IP Commands, each of which performs a specific alteration to the image. IP Profiles are highly configurable. There are multiple different IP Commands, each of which has its own configurable properties as well. In the example above, the image was altered using an IP Profile with six steps, each step containing a different IP Command.

However, for the example above, the IP Profile's result is drastically different from the original image. While it certainly helps the OCR result, it's likely, at the end of the process, you want to export a document that looks more like the "before" picture than the "after". Luckily, Image Processing can be performed in two ways:

  1. Permanent for archival purposes.
  2. Temporary for non-destructive OCR cleanup.

For more information on Image Processing, visit the Image Processing article.

OCR Synthesis

The Synthesis functionality is Grooper's unique method of pre-processing and re-processing raw results from the OCR engine to get better results out of it. Using Synthesis, portions of the document can be OCR'd independently from the full text OCR, portions of the image dropped out from the first OCR pass can be re-run, and certain results can be reprocessed. The results from the Synthesis operation then get combined with the full text OCR results from the OCR Engine into a single text flow.

Synthesis is a collection of five separate OCR processing operations:

  • Font Pitch Detection
  • Bound Region Processing
  • Iterative Processing
  • Cell Validation
  • Segment Reprocessing

As separate operations, the user can choose to enable all four operations, choose to use only one, or any combination. Synthesis is enabled on OCR Profiles, using the "Synthesis" property. This property is enabled by default on OCR Profiles (and can be disabled if you so choose). However, each Synthesis operation needs to be configured independently in order to function.

For more information on each operation, visit the Synthesis article.

How do you configure OCR settings? - The OCR Profile

Now that we've talked a little bit about OCR in general, OCR engines, and some additional considerations such as image processing and Grooper's Synthesis settings, how do we tell Grooper how to execute all these considerations? That is done by creating and configuring an OCR Profile.

An OCR Profile is an object created in Grooper to store various settings controlling how OCR is performed.

This includes:

  • Setting which OCR Engine is used
  • Determining whether a temporary IP Profile is used for image cleanup before the OCR engine runs
  • Grooper's unique Synthesis settings
    • Determining if and how multiple OCR results are pre-processed and re-processed
  • If and how results are filtered, to toss out undesirable results.
  • Any configurable settings available from the OCR Engine


Below you will see one of the default OCR Profiles that ship with Grooper named "Full Text - Accurate", with these settings highlighted in each tab.

Here, you will list which OCR Engine will perform character recognition.

This OCR Profile is set to Transym OCR 4, using the Transym 4.0 OCR software to recognize characters.

One of the things that sets Grooper apart from other document processing platforms is the high degree of configuration options when it comes to image processing. The basic idea, here, is to give the OCR engine a "cleaned up" version of the document to use for OCR. When configured on an OCR Profile this is "temporary" in that the archival version of the document is not changed. Once OCR is finished, the document will revert to its original form. The image will only be altered for the purposes of obtaining OCR results.

These image processing settings are defined with a different type of profile called an IP Profile, which is then referenced by the OCR Profile's IP Profile property.

This OCR Profile uses a pre-built IP Profile called "OCR Cleanup"

Another thing that sets Grooper apart when it comes to OCR is our suite of Synthesis operations. These are different capabilities Grooper has to pre-process and re-process OCR results to improve the OCR engine's results.

This OCR Profile uses a variety of these Synthesis properties, all of which are highlighted in yellow. To learn more about this suite of properties, what they do, how they improve OCR results, and how to configure them, visit the Synthesis article.

The Result Filtering settings allow you to isolate certain characters and remove them from your results. Maybe you want to discard any characters that do not meet a minimum confidence score. Maybe you want to discard all characters below a certain font size. Maybe you want to discard all characters within a certain distance to the edge of the page. You can do those things (and more) using these Result Filtering settings.

This OCR Profile does not use any of settings. However, they are highlighted below.

Each OCR Engine has its own set of properties available to Grooper as well. These properties change from OCR engine to OCR engine, depending on which settings are exposed to Grooper from the OCR engine's software. However, they are always in the right window panel of the OCR Profile

This OCR Profile uses Transym 4.0, whose settings are seen in the highlighted portion.