2021:OCR Engine (Property): Difference between revisions

From Grooper Wiki
No edit summary
Line 11: Line 11:
* '''''Azure OCR''''' uses Microsoft Azure AI Vision to OCR documents (Azure API key required to connect Grooper to Azure AI Vision)
* '''''Azure OCR''''' uses Microsoft Azure AI Vision to OCR documents (Azure API key required to connect Grooper to Azure AI Vision)


== About ==
== About traditional OCR engines ==
OCR Engines perform the "heavy lift" of the OCR operation by getting the raw character data off document images.  They analyze the pixels on the image and figure out what text characters they match.
OCR Engines perform the "heavy lift" of the OCR operation by getting the raw character data off document images.  They analyze the pixels on the image and figure out what text characters they match.



Revision as of 16:16, 4 October 2024

This article is about an older version of Grooper.

Information may be out of date and UI elements may have changed.

202520212.90

An "OCR engine" is the part of OCR software that recognizes text from images. OCR engines analyze the image's pixels to determine where text is on the page and what each character is. In Grooper, OCR engines are selected when configuring an OCR Profile's OCR Engine property.

The Transym OCR 4, Transym OCR 5, Tesseract and Azure OCR engines are included in Grooper.

  • Transym OCR 4 provides highly accurate English-only OCR.
  • Transym OCR 5 provides multi-language OCR for 28 different languages.
  • Tesseract is Google's open source OCR engine.
  • Azure OCR uses Microsoft Azure AI Vision to OCR documents (Azure API key required to connect Grooper to Azure AI Vision)

About traditional OCR engines

OCR Engines perform the "heavy lift" of the OCR operation by getting the raw character data off document images. They analyze the pixels on the image and figure out what text characters they match.

OCR Engines themselves have four phases:

  1. Pre-Processing
  2. Segmenting
  3. Character Recognition
  4. Post-Processing

Pre-Processing

First and foremost, OCR applications require a black and white image in order to determine what pixels on a page are text. So, color and grayscale images must be converted to black and white. This is done by a process called "thresholding" which determines a middle point between light pixels and dark pixels on the page. Lighter pixels are then turned into white and darker ones are turned into black pixels. You are left with only black and white pixels, with (ideally) all text in black and everything else faded into a white background.

The original scanned image... ...is turned black and white to divide the page into black pixels (text) and white pixels (the background).

Some OCR Engines also contain de-skewing, despeckling, line removal, aspect ratio normalization, or other pre-processing functions to improve OCR results.

FYI

Grooper has it's own pre-processing capabilities through its Image Processing operations. OCR Engines typically place these pre-processing functions in a "black box" for users. At best, the OCR Engine may allow you to turn the property "on" or "off" but may not allow you to configure it further to fine tune its results. Custom Image Processing can be performed using IP Profiles made of highly configurable IP Commands.

Segmenting

One of the most important aspects of pre-processing is "segmentation". This is the process of breaking up a page into first lines, then words, and, finally, individual characters.

In general, this process involves distinguishing between text and the white space between text. Lines of text are distinguished by the horizontal space between one line and another. This can be seen using a histogram projection profile.

The gray peaks on the left side of the image show the amount of black pixels on the page. The larger the peak, the larger the number of black pixels on that line. OCR "sees" the line break where there are gaps are between those collections of pixels.

Words can be broken up in a similar way. One expects a small amount of space between characters. How we tell the difference between "rn" and "m", after all, is just that tiny amount of space between the "r" and "n". Between words, however, that space should be a bit larger. So, words are segmented at points where there are larger than normal amounts of white space between characters on a line.

In a perfect world, characters would be segmented out at this point as well. After all, there is still some space between each character, just a little smaller than between each word. You can easily see this with fixed-pitched fonts. However, the world of printed text is rarely that perfect.

Looking at the image below, there is no white space between the "a" and "z" or "z" and "e" in "Hazel". Just looking at the histogram projection, there's no break in the pixels to define where one character stops and another begins. There's a slight break in the "n" in "Daniels". So, there is some white space in the middle of the character where there shouldn't be. But, that shouldn't mean those are two separate characters.

If the characters were separated out using the normal segmenting we've seen previously, we might expect a very poor result. However, ultimately, we get the result we expect, "Hazel Daniels".

Modern OCR Engines perform more sophisticated character level segmenting than just looking for small gaps between characters. Characters connected by small artifacts can be isolated from each other and characters that are broken apart can be linked together. This is done both by analyzing the peaks and valleys of pixel densities to determine where a gap "should" be as well as further segmenting the word to look at the context of portions of a character before and after to make a decision as to where one character starts and stops.

Once the OCR Engine has segmented the entire image into individual character segments, it can use character recognition to determine what character corresponds to that segment. However, this is where a lot of OCR errors can crop up. Depending on the quality of the image or the original document, characters can be joined and disconnected in many different ways. The OCR Engine may not perfectly separate out one segment of a word as the "right" character.

FYI

Some amount of OCR errors are unavoidable. Document quality, scan quality, non-standard fonts and other issues can interfere with the OCR Engine producing 100% accurate results. Part of Grooper's job is to massage the OCR Engine's results, through Image Processing, OCR Synthesis and Fuzzy Matching, into more accurate ones.

Character Recognition

Once the OCR Engine parses out the image into lines, and then words, and finally character segments, it must make a decision about what text character that character segment actually is. We're ready to do the "character recognition" part of "Optical Character Recognition".

There are two basic types of recognition algorithms: matrix matching and feature extraction.

Matrix matching compares a NxN matrix of pixels on a page to a library of stored character glyph examples. This is also known as "pattern recognition" or "image correlation".

The character on the document's image... ...is compared to a stored example... ...by comparing a matrix of pixels, between the character on the image and the stored example.

The OCR Engine then makes a decision about what character that matrix of pixels is. In this case, a "G". Matrix matching does have some pitfalls, however. Because it is comparing text to the stored glyph pixel by pixel, the text needs to be similar to the stored glyph's font and scale in order to match. While there may be hundreds of example gylphs of various fonts and scales for a single character, this can cause problems when matching text on poor quality images or using uncommon fonts.

The second type of recognition algorithm, feature extraction, decomposes characters into their component "features" like lines, line direction, line intersections or closed loops. For example, an "O" is a closed loop, but a "C" is an open loop. These features are compared to vector-like representations of a character, rather than pixel representations of the character. This is a newer recognition technology. Because it looks at features that make up a character, rather than a pixel by pixel comparison to a glyph of a certain font, this method is not as reliant on the font used on the page matching a particular stored example.

Instead of pixels... ...features matching how the character is drawn... ...are compared to how those features are used to draw stored glyphs.

Just like with matrix matching, the OCR engine makes a decision about what character matches the extracted features. In this case, again, a "G". In engines using a combination of matrix matching and feature extraction, the results of both algoritms are combined to produce the best matching result. Each character is given a "confidence score", which corresponds to how closely the character segment's pixels matched either the stored glyph's matrix or features or combination of the two.

This presents another layer of potential errors. Given a document's quality, fonts used, and even the variable size of fonts on a page, the OCR Engine may recognize a character as the wrong glyph.

What is this character? Is it a "G"? Is it a "C"? Is it a "0"? Is it an "O"? Is it just garbage?

The OCR Engine has to make a decision, which ultimately may not line up with what it is on the page. Especially in situations like this, where it may be difficult for even a human being to read the character, the OCR Engine will have a hard time recognizing the character.

FYI Some amount of OCR errors are unavoidable. Document quality, scan quality, non-standard fonts and other issues can interfere with the OCR Engine producing 100% accurate results. Part of Grooper's job is to massage the OCR Engine's results, through Image Processing, OCR Synthesis and Fuzzy Matching, into more accurate ones.

Post-Processing

Without any context around a character, these OCR errors can make sense. The letter "a" and "o" can look fairly similar, especially using certain fonts. However, the word "ballboy" is a real word, and "bollboy" is utter nonsense.

Similar characters may get misread by OCR... ...which becomes obvious given the context of that character inside a word.

The most common post-processing done by OCR Engines is basic spell correction. Often errors resulting from poor character recognition result in small spelling mistakes. For many OCR Engines (all commercial OCR Engines), results are compared to a lexicon of common English words, and replacements are made if possible.

Note, this correction is only going to apply to words in the OCR Engine's lexicon. Proper nouns (unless in the lexicon) will not be corrected.

FYI When the OCR Engine is not capable of confidently making a spell correction, Grooper may be able to use Fuzzy RegEx to get around OCR errors.

OCR Engines: What is the best OCR engine?

The short answer is the best OCR engine is the best one for your needs. Grooper offers a variety of OCR engine integrations. There are many considerations you'll want to evaluate. Of course, you want your OCR engine to be accurate. However, the more accurate an OCR process is the more computationally expensive it tends to be. It takes more computing power (and hence processing time) to get more accurate results. Highly accurate OCR results tend to cost more too. For example, Google's Tesseract is an open-source OCR engine, which is great in terms of it being free. However, its results simply are not as accurate as ABBY FineReader which is a "premium" OCR engine option available to Grooper.

All Grooper installations ship with Transym's OCR engine. We favor Transym because it provides accurate OCR results, with efficient use of computational resources. It is also very reasonably priced for its accuracy and performance.

See below for more details on all the OCR engine's available in Grooper.

Transym 4.0

All Grooper installations come standard with Transym 4.0. This engine provides highly accurate English language OCR results of printed text at no additional cost to the user. It is the standard by which we judge other engines. It does particularly well with high density images, containing a great deal of text and characters to parse through (Including large amounts of numerical and non-semantic strings). Transym also has the benefit of being computationally efficient. For the accuracy of its results, its speed is unparalleled.

  • Machine print recognition only. Does not recognize handwriting.
  • Fully installed with Grooper
  • English language only

Transym 5.0

Transym 5.0 is also included in every Grooper installation. This engine expands Transym 4.0 to support multiple languages, including additional characters for Greek, Cyrillic, and Eastern European alphabets.

  • Machine print recognition only. Does not recognize handwriting.
  • Fully installed with Grooper
  • Supports 38 different languages

Tesseract

Tesseract is Google's open source OCR engine for printed text. Tesseract (version 3.04) is included in all Grooper installations starting in version 2.72. Tesseract is generally not as accurate as Transym, but can perform better in certain situations. Tesseract is unique in that you can train the engine to recognize fonts, providing better accuracy for non-standard fonts not handled well by other OCR engines. Out of the box, Tesseract in Grooper can be configured to read MICR, OCR-A, and OCR-B fonts. Tesseract is also much slower than Transym. It's not uncommon for Tesseract to OCR a page ten times slower than Transym.

  • Machine print recognition only. Does not recognize handwriting.
  • Fully installed with Grooper.
    • Currently, Grooper supports Tesseract version 3.04
  • By default, supports English only. However, additional languages can be downloaded here. A ZIP file containing all available languages can be downloaded here.
    • Downloaded files should be placed in the "{Grooper Install Directory}\Tesseract\tessdata" folder on each machine which will be running OCR.
    • Note: not all languages that are downloadable are supported by Grooper. E.g. Vertical languages, right-to-left languages, etc. are not supported.

Azure OCR

Starting in version 2.72, the Microsoft Azure OCR engine is available in Grooper. Azure OCR uses Microsoft's Azure AI Vision to perform printed text or handwritten character recognition. From our testing of this OCR engine, it does particularly well at recognizing handwritten text. It does, however, require a subscription key at an additional cost to the user.

  • Machine print and handwriting recognition
  • Requires a Microsoft Computer Vision API key.
    • Want to try it out? You can create a free trial Azure account here. There will be some limitations on the number of API calls per month (in other words, number of documents you can process per month) and/or trial period length.
  • Supports 26 different languages

Layered OCR

While not technically an OCR engine itself, this is listed as an option for your OCR engine when configuring an OCR Profile. Layered OCR allows you to merge the results from multiple different OCR Profiles. Layered OCR performs OCR by establishing baseline OCR results from a primary or main OCR Profile and merges the results from one or more OCR Layers, each containing their own OCR Profile.

To learn how to layer OCR results and the configuration requirements for Layered OCR, please visit the Layered OCR article.

Glossary

Data Extractor: Data Extractor (or just "extractor") refers to all Value Extractors and Extractor Nodes. Extractors define the logic used to return data from a document's text content, including general data (such as a date) and specific data (such as an agreement date on a contract).

Extract: export_notes Extract is an Activity that retrieves information from folder Batch Folder documents, as defined by Data Elements in a data_table Data Model. This is how Grooper locates unstructured data on your documents and collects it in a structured, usable format.

Fuzzy RegEx: Fuzzy RegEx is Grooper's use of fuzzy logic within Value Extractors that leverage regular expressions to match patterns. Fuzzy RegEx allows extractors to overcome defects in a document's OCR results to accurately return results. Fuzzy RegEx is enabled by enabling the Fuzzy Matching property.

Image Processing: wallpaper Image Processing is an Activity that enhances contract Batch Page images and optimizes them for better OCR text recognition and data extraction results.

Image Processing: wallpaper Image Processing is an Activity that enhances contract Batch Page images and optimizes them for better OCR text recognition and data extraction results.

IP Command: IP Commands specify an image processing (IP) operation (such as image cleanup, format conversion or feature detection) and are used to construct image IP Steps in an IP Profile. IP Commands are configured using an IP Step's Command property.

IP Profile: perm_media IP Profiles are a step-by-step list of image processing operations (IP Commands). They are used for several image processing related operations, but primarily for:

  1. Permanently enhancing an image during the Image Processing activity (usually to get rid of defects in a scanned image, such as skewing or borders).
  2. Cleaning up an image in-memory during the Recognize activity without altering the image to improve OCR accuracy.
  3. Computer vision operations that collect layout data (table line locations, OMR checkboxes, barcode value and more) utilized in data extraction.

Layered OCR:

Machine: computer Machine nodes represent servers that have connected to the Grooper Repository. They are essential for distributing task processing loads across multiple servers. Grooper creates Machine nodes automatically whenever a server makes a new connection to a Grooper Repository's database. Once added, Machine nodes can be used to view server information and to manage Grooper Service instances.

OCR Engine: An "OCR engine" is the part of OCR software that recognizes text from images. OCR engines analyze the image's pixels to determine where text is on the page and what each character is. In Grooper, OCR engines are selected when configuring an OCR Profile's OCR Engine property.

OCR Profile: library_books OCR Profiles store configuration settings for optical character recognition (OCR). They are used by the Recognize activity to convert images of text on contract Batch Pages into machine-encoded text. OCR Profiles are highly configurable, allowing fine-grained control over how OCR occurs, how pre-OCR image cleanup occurs, and how Grooper's OCR Synthesis occurs. All this works to the end goal of highly accurate OCR text data, which is used to classify documents, extract data and more.

OCR Synthesis: OCR Synthesis refers to a suite of OCR related functionality unique to Grooper. The OCR Synthesis suite will pre-process and re-process raw results from the OCR Engine and synthesize its results into a single, more accurate OCR result.

OCR: OCR is stands for Optical Character Recognition. It allows text on paper documents to be digitized, in order to be searched or edited by other software applications. OCR converts typed or printed text from digital images of physical documents into machine readable, encoded text.