OCR Profile (Node Type)

From Grooper Wiki
An example of a configured OCR Profile's property settings

An OCR Profile defines the settings for performing OCR.

This includes:

  • Setting which OCR Engine is used
  • Determining whether a temporary IP Profile is used for image cleanup before the OCR engine runs
  • Grooper's unique Synthesis settings
    • Determining if and how multiple OCR results are pre-processed and re-processed
  • If and how results are filtered, to toss out undesirable results.
  • Any configurable settings available from the OCR Engine

About

At first glance, an OCR profile may look like a wall of properties, and in some ways, it is. They are a way to save a collection of properties that determine how OCR results are obtained. Let's break these properties down, using a configured OCR Profile as an example.

Below you will see one of the default OCR Profiles that ship with Grooper named "Full Text - Accurate", with these settings highlighted in each tab.

Here, you will list which OCR Engine will perform character recognition.

This OCR Profile is set to Transym OCR 4, using the Transym 4.0 OCR software to recognize characters.

One of the things that sets Grooper apart from other document processing platforms is the high degree of configuration options when it comes to image processing. The basic idea, here, is to give the OCR engine a "cleaned up" version of the document to use for OCR. When configured on an OCR Profile this is "temporary" in that the archival version of the document is not changed. Once OCR is finished, the document will revert to its original form. The image will only be altered for the purposes of obtaining OCR results.

These image processing settings are defined with a different type of profile called an IP Profile, which is then referenced by the OCR Profile's IP Profile property.

This OCR Profile uses a pre-built IP Profile called "OCR Cleanup"

Another thing that sets Grooper apart when it comes to OCR is our suite of Synthesis operations. These are different capabilities Grooper has to pre-process and re-process OCR results to improve the OCR engine's results.

This OCR Profile uses a variety of these Synthesis properties, all of which are highlighted in yellow. To learn more about this suite of properties, what they do, how they improve OCR results, and how to configure them, visit the Synthesis article.

The Result Filtering settings allow you to isolate certain characters and remove them from your results. Maybe you want to discard any characters that do not meet a minimum confidence score. Maybe you want to discard all characters below a certain font size. Maybe you want to discard all characters within a certain distance to the edge of the page. You can do those things (and more) using these Result Filtering settings.

This OCR Profile does not use any of settings. However, they are highlighted below.

Each OCR Engine has its own set of properties available to Grooper as well. These properties change from OCR engine to OCR engine, depending on which settings are exposed to Grooper from the OCR engine's software. However, they are always in the right window panel of the OCR Profile

This OCR Profile uses Transym 4.0, whose settings are seen in the highlighted portion.

The OCR Testing Tab

When you select them in the Node Tree, OCR Profiles also contain an "OCR Testing" tab to verify results of the profile. This will pull up a testing module, allowing us to select documents from a Test Batch, OCR individual pages, and view some extra diagnostic information that will help fine tune your property settings.

  1. To access this testing module, first select an OCR Profile in the Node Tree. Here, we've selected the "Full Text - Accurate" OCR Profile that comes with all Grooper installations.
    • You can follow this path in the Node Tree to find this OCR Profile:
    • Root Node > Global Resources > OCR Profiles > Downloads > Full Text - Accurate
  2. Click the "OCR Testing" tab to bring up the OCR Profile testing module.
  1. From here, you can select a Test Batch of documents using the Batch selector dropdown menu.
  2. Select a Test Batch from the list. Here, we're selecting a Test Batch named "OCR Example".
  1. Select a page in the batch.
    • You will see the selected page appear in the "Document Viewer" window.
  2. Press the "OCR Page" button.

Once OCR is finished you will see OCR results appear in the "Layout View" tab in the bottom of the screen.

  1. Notice as well as the "Layout View", "Text View" and "Character View" tabs, there is now a "Diagnostics" tab too. This tab only appears whenever you do this ad hoc testing of an OCR Profile on an individual page.
  2. It provides a variety of diagnostics images just as the "IP Image" selected here. This shows you the pre-processed version of this page handed to the OCR engine (as altered by the IP Profile set on the OCR Profile).
As you can see here, this diagnostic image does not contain any table lines, whereas the actual page does. This table lines were removed temporarily by a Line Removal step in the IP Profile (a very common temporary image processing adjustment to improve OCR results).
The other diagnostic images have to do with Grooper's Synthesis settings. You can learn more about these settings and these diagnostics images by visiting the Synthesis article.
Last but certainly not least, there is always an "Execution Log" file at the bottom of this diagnostics panel. This file is a text file detailing information about the OCR operation. This file can be particularly helpful when configuring Grooper's Synthesis settings as well.

Use Cases

OCR Profiles are required to obtain machine readable text from any image based content. Based on the image quality or source document quality, this may range from a relatively simply configured OCR Profile, perhaps just setting the OCR engine to be used, to a more complex one, taking advantage of temporary image processing, Grooper's Synthesis suite, or Result Filtering settings.

The only time you won't use an OCR Profile to obtain machine readable text is if you are only processing documents with full native text. These would be digital documents like a PDF created with encoded text already present that can be extracted via the Native Text Extraction functionality of the Recognize activity.

How To

Create an OCR Profile

Execute an OCR Profile