OCR (Concept): Difference between revisions

Revision as of 15:44, 10 February 2020

OCR stands for Optical Character Recognition. It allows text from paper documents to be digitized to be searched or edited by other software applications. OCR converts typed or printed text from digital images of physical documents into machine-readable, encoded text. This conversion allows Grooper to search text characters from the image, providing the capability to separate images into documents, classify them and extract data from them.

About

The general process of OCR'ing a document is as follows in Grooper:

1) The document image is handed to the Recognize activity, which references an OCR Profile, containing the settings to perform the OCR operation.

2) The OCR Engine (set on the OCR Profile) converts the pixels on the image into machine readable text for the full page.

3) Grooper reprocesses the OCR Engine's results and runs additional OCR passes using the OCR Profile's Synthesis properties.

4) The raw OCR results from the OCR Engine and Grooper's Synthesis results are combined into a single text flow.

5) Undesirable results can be filtered out using Grooper's Results Filtering options.

What is an OCR Engine?

OCR Engines are software applications that perform the actual recognition of characters on images, analyzing the pixels on the image and figuring out what text characters they match.

OCR Engines themselves have three phases:

Pre-Processing

First and foremost, OCR applications require a black and white image in order to determine what pixels on a page are text. So, color and grayscale images must be converted to black and white. This is done by a process called "thresholding" which determines a middle point between light pixels and dark pixels on the page. Lighter pixels are then turned into white and darker ones are turned into black pixels. You are left with only black and white pixels, with (ideally) all text in black and everything else faded into a white background.

The original scanned image...	...is turned black and white to divide the page into black pixels (text) and white pixels (the background).

Some OCR Engines also contain de-skewing, despeckling, line removal, aspect ratio normalization, or other pre-processing functions to improve OCR results.

FYI

Grooper has it's own pre-processing capabilities through its Image Processing operations. OCR Engines typically place these pre-processing functions in a "black box" for users. At best, the OCR Engine may allow you to turn the property "on" or "off" but may not allow you to configure it further to fine tune its results. Custom Image Processing can be performed using IP Profiles made of highly configurable IP Commands.

Character Recognition

There are two basic types of recognition algorithms: matrix matching and feature extraction.

Matrix matching compares a NxN matrix of pixels on a page to a library of stored character glyph examples. This is also known as "pattern recognition" or "image correlation".

The character on the document's image...

...is compared to a stored example...

...by comparing a matrix of pixels, between the character on the image and the stored example.

The OCR Engine then makes a decision about what character that matrix of pixels is. In this case, a "G". Some kind of confidence or similarity score is also assigned. The example above is pretty similar to the stored glyph. It would score something like 99%. Matrix matching does have some pitfalls, however. Because it is comparing text to the stored glyph pixel by pixel, the text needs to be similar to the stored glyph's font and scale in order to match. While there may be hundreds of example gylphs of various fonts and scales for a single character, this can cause problems when matching text on poor quality images or using uncommon fonts.

The second type of recognition algorithm, feature extraction, decomposes characters into their component "features" like lines, line direction, line intersections or closed loops. These features are compared to vector-like representations of a character, rather than pixel representations of the character.

Instead of pixels...

...features matching how the character is drawn...

...are compared to how those features are used to draw stored glyphs.

Post-Processing

@@ Line 58: / Line 58: @@
 |}
-The OCR Engine then makes a decision about what character that matrix of pixels is.  In this case, a "G".  Some kind of confidence or similarity score is also assigned.  The example above is pretty similar to the stored glyph.  It would score something like 99%.  Matrix matching does have some pitfalls, however.  Because it is comparing text to the stored glyph pixel by pixel, the text needs to be similar to the stored glyph's font and scale in order to match (Although there may be hundreds of example gylphs of various fonts and scales for a single character).  This can cause problems when matching text on poor quality images or using uncommon fonts.
+The OCR Engine then makes a decision about what character that matrix of pixels is.  In this case, a "G".  Some kind of confidence or similarity score is also assigned.  The example above is pretty similar to the stored glyph.  It would score something like 99%.  Matrix matching does have some pitfalls, however.  Because it is comparing text to the stored glyph pixel by pixel, the text needs to be similar to the stored glyph's font and scale in order to match.  While there may be hundreds of example gylphs of various fonts and scales for a single character, this can cause problems when matching text on poor quality images or using uncommon fonts.
+The second type of recognition algorithm, feature extraction, decomposes characters into their component "features" like lines, line direction, line intersections or closed loops.  These features are compared to vector-like representations of a character, rather than pixel representations of the character.
+{|cellpadding="10" style="margin:auto"
+|-style="text-align:center" valign="top"
+|style="width:260px"|Instead of pixels...
+|style="width:260px"|...features matching how the character is drawn...
+|style="width:260px"|...are compared to how those features are used to draw stored glyphs.
+|}
+{|cellpadding="10" style="margin:auto"
+|[[file:ocr feature 1.png|border]]||[[file:ocr feature 2.png|border]]||[[file:ocr feature 3.png|border]]
+|}
 *Post-Processing
 **