OCR (Concept)

OCR stands for Optical Character Recognition. It allows text from paper documents to be digitized to be searched or edited by other software applications. OCR converts typed or printed text from digital images of physical documents into machine readable, encoded text. This conversion allows Grooper to search text characters from the image, providing the capability to separate images into documents, classify them and extract data from them.

About

The quick explanation of OCR is it analyzes pixels on an image and translates those pixels into text. Most importantly, it translates pixels into machine readable text. Grooper can be described as a document modeling platform. You use the platform to model how pages are separated out into documents, how one document gets put into one category or another, and how extractable data is structured on the document. Once you have this model of what a document is, how it fits into a larger document set, and where the data is on it, you can use it to programmatically process any document that fits the model.

In order to do any of that, you have to be able to read the text on the page. How do you know an invoice is an invoice? A simple way could be locating the word "invoice" (or other text associated with the invoice). You, as a human, do this by looking at the ink on a page (or pixels for a digital document) and reading the word "invoice". Grooper does this by using a Data Extractor (and regular expression) to read the machine readable text for the page. OCR is how each page gets that machine readable text in order to model the document set and process it.

The General Process

In Grooper, OCR is performed by the Recognize activity, referencing an OCR Profile which contains all the settings to get the OCR results, including which OCR Engine is used. The OCR Profile also has settings to optionally process those results to increase the accuracy of the OCR Engine used. The general process of OCR'ing a document is as follows in Grooper:

1) The document image is handed to the Recognize activity, which references an OCR Profile, containing the settings to perform the OCR operation.

2) The OCR Engine (set on the OCR Profile) converts the pixels on the image into machine readable text for the full page.

3) Grooper reprocesses the OCR Engine's results and runs additional OCR passes using the OCR Profile's Synthesis properties.

4) The raw OCR results from the OCR Engine and Grooper's Synthesis results are combined into a single text flow.

5) Undesirable results can be filtered out using Grooper's Results Filtering options.

The Recognize activity is handed the document image and performs OCR.

The results are seen here in a text flow.

The results are seen here in a "Layout View", using the character positions and font sizes obtained to overlay where they are on the document.

OCR vs. Native Text

OCR gets text specifically from images, whether they were printed and scanned or imported from a digital source. However, what if the document created digitally and imported in its original digital form? Wouldn't it have been created on a computer, using machine readable text? Most likely, yes! If a form was created using a product like Adobe Acrobat and filled in using a computer, the text comprising the document and the filled fields is encoded within the document itself. This is called "Native Text". This text is already machine readable. So there is no reason to OCR the document. Instead, the native text is extracted via Grooper's native text extraction. Native text has a number of advantages over OCR. OCR is not perfect. As you will see, OCR is a fairly complicated process with a number of opportunities to misread a document. Grooper has plenty of advancements to get around these errors and produce a better result, but OCR will rarely be as accurate as the original native text from a digital document.

However, be careful. Just because a PDF document has machine readable text behind it, does not mean that text is native text. If the document was OCR'd by a different platform, the text may have been inserted into the PDF (Grooper also has this capability upon exporting document). In these cases, we still recommend OCR'ing the document to take advantage of Grooper's superior OCR capabilities and get a more accurate result.

Regardless whether getting machine readable text through OCR or Native Text Extraction, both are done via the Recognize activity. In the case of OCR, you will need to create an OCR Profile containing all the settings to perform OCR and reference it during the Recognize activity. Native Text Extraction is enabled by default, but can be disabled if you wish to use OCR instead.

What is an OCR Engine?

OCR Engines are software applications that perform the actual recognition of characters on images, analyzing the pixels on the image and figuring out what text characters they match.

OCR Engines themselves have three phases:

1) Pre-Processing: In this phase, the OCR image prepares the image to be read by turning color and grayscale images to black and white and potentially removing artifacts getting in the way of OCR, such as specks and lines. Text is also segmented from lines to words and finally characters in this phase as well.

2) Character Recognition: Here, the OCR engine takes those pixel character segments, compares them to examples of character glyphs, and makes a decision about which machine readable text character matches the segment.

3) Post-Processing: Commercial OCR engines also analyze the OCR results and attempt to correct inaccurate results, such as performing basic spellchecking.

For more in depth information on how OCR engines work, visit the OCR Engine article.

The Transym 4 and Transym 5 OCR engines are included in Grooper's licensing. Transym OCR 4 provides highly accurate English-only OCR while Transym OCR 5 provides multi-language OCR for 28 different languages. Google's open source Tesseract engine is available in version 2.72 and beyond. ABBY FineReader, Prime OCR, and Azure OCR are also supported but require separate installations and separate licensing.

Image Processing and OCR

Regardless of how good an OCR engine is, OCR is very rarely perfect. Characters can be segmented out wrong. Artifacts such as table lines, check boxes or even just specks from image noise can interfere with breaking out character segments from words and lines. Even when they are segmented out correctly, the OCR engine's character recognition can make the wrong decision about what the character is.

Image Processing (often abbreviated as "IP") can assist the OCR operation by providing a "cleaner" image to the OCR engine. The general idea is to give the OCR engine just the text pixels, so that is all the engine needs to process.

This image is much easier for OCR to process...	...than this image.

Images altered using an IP Profile, which contains a step by step list of IP Commands, each of which performs a specific alteration to the image. IP Profiles are highly configurable. There are multiple different IP Commands, each of which has its own configurable properties as well. In the example above, the image was altered using an IP Profile with six steps, each step containing a different IP Command.

IP StepsAuto Border CropBinarizeShape RemovalLine Removal, Speck Removal, and Blob RemovalResults

This is the list of steps in this IP Profile, each one named for the IP Command used: Auto Border Crop, Binarize, Shape Removal, Line Removal, Speck Removal, and Blob Removal

!

Order of operation matters! The image is altered step by step, from the first to the last. The first step hands the second step the results of its IP Command. The second step runs using the mutated image not the original image. The second step then hands its result to the third step and so on and so on.

For this IP Profile, the first step runs the "Auto Border Crop" command. It crops the image, removing its border.

In this case, the border was actually part of the document. Usually, borders appear around documents because of how they were scanned. However, the goal is still the same. Remove superfluous, non-text pixels interfering with the OCR operation.

Before	After

The second step runs the "Binarize" command. The Binarize command turns the image black and white.

OCR requires a black and white image to analyze pixels and segment them out into characters. It needs a binary representation of the image: "Is text" or "is not text" While OCR engines will binarize an image as part of their pre-processing phase, doing so in an IP Profile gives you control over how that operation is done. There is more than one way to binarize an image, and you have no control over it if you let the OCR engine do it for you.

Before	After

The next step runs the "Shape Removal" command. Shape Removal takes trained examples of shapes, in this case the company logo, locates them on a document and removes them.

While probably not strictly necessary for this example, removing the shape does give OCR one less thing the OCR engine has to look at when segmenting out characters and figuring out what characters these pixels should be.

Before	A dropout mask is created for the detected sample shape.	After

The last three commands are Line Removal, Speck Removal, and Blob Removal. These three commands find three different types of pixel artifacts (lines, specks and blobs) and remove them, further isolating pixels that are only text characters.

Line Removal
Before	A dropout mask is created for detected lines.	After

Speck Removal
Before	A dropout mask is created for small specks. Here, mostly getting rid of dotted lines.	After

Blob Removal
Before	A dropout mask is created for detected blobs of a defined size.	After

You are left with an image that the OCR engine can much easier break up into line, word and finally character segments, vastly improving the accuracy of character recognition.

A portion of OCR results without applying the IP Profile	The same results with the IP Profile applied.

However, for the example above, the IP Profile's result is drastically different from the original image. While it certainly helps the OCR result, it's likely, at the end of the process, you want to export a document that looks more like the "before" picture than the "after". Luckily, Image Processing can be performed in two ways:

Permanent for archival purposes.
Temporary for OCR cleanup.

Permanent

Permanent Image Processing is done via the Image Processing activity. It is, as the name implies, a permanent alteration of the document's image. The Image Processing activity will reference an IP Profile and permanently apply its IP Commands to the document images. Once that image is changed, it is changed for the remainder of the Batch Process. There is no going back! IP Profiles used by the Image Processing activity should only use commands acceptable for final export.

The three categories of most commonly used IP Commands for permanent cleanup are

Border Cleanup - These commands clean up border artifacts around an image by cropping the image or filling in the border with a given color.
Color Adjustment - These commands adjust the color values of the image, including brightness, color saturation, and contrast.
Image Transforms - These commands change the image's size and orientation.

Of the commands in those categories, there are one or two that are particularly common.

Border Cleanup

Auto Border Crop, Border Fill

Color Adjustment

Brightness Contrast, Contrast Stretch

Image Transforms

Auto Deskew, Auto Orient

Temporary

Temp text

OCR Synthesis

stuffs