2023:OCR Profile (Node Type)

WIP

This article is a work-in-progress or created as a placeholder for testing purposes. This article is subject to change and/or expansion. It may be incomplete, inaccurate, or stop abruptly.

This tag will be removed upon draft completion.

An OCR Profile defines the settings for performing OCR.

This includes:

Setting which OCR Engine is used
Determining whether a temporary IP Profile is used for image cleanup before the OCR engine runs
Grooper's unique Synthesis settings
- Determining if and how multiple OCR results are pre-processed and re-processed
If and how results are filtered, to toss out undesirable results.
Any configurable settings available from the OCR Engine

Previous Versions
Grooper 2.80

About

At first glance, an OCR profile may look like a wall of properties, and in some ways, it is. They are a way to save a collection of properties that determine how OCR results are obtained. Let's break these properties down, using a configured OCR Profile as an example.

OCR - 2023

The OCR Testing Tab

When you select them in the Node Tree, OCR Profiles also contain an "OCR Testing" tab to verify results of the profile. This will pull up a testing module, allowing us to select documents from a Test Batch, OCR individual pages, and view some extra diagnostic information that will help fine tune your property settings.

How to Find the Testing ModuleSelecting a Test BatchTesting a PageReviewing the OCR Results

To access this testing module, first select an OCR Profile in the Node Tree. Here, we've selected the "Full Text - Accurate" OCR Profile that comes with all Grooper installations.
- You can follow this path in the Node Tree to find this OCR Profile:
- Root Node > Global Resources > OCR Profiles > Downloads > Full Text - Accurate
Click the "OCR Testing" tab to bring up the OCR Profile testing module.

From here, you can select a Test Batch of documents using the Batch selector dropdown menu.
Select a Test Batch from the list. Here, we're selecting a Test Batch named "OCR Example".

Select a page in the batch.
- You will see the selected page appear in the "Document Viewer" window.
Press the "OCR Page" button.

Once OCR is finished you will see OCR results appear in the "Layout View" tab in the bottom of the screen.

Notice as well as the "Layout View", "Text View" and "Character View" tabs, there is now a "Diagnostics" tab too. This tab only appears whenever you do this ad hoc testing of an OCR Profile on an individual page.
It provides a variety of diagnostics images just as the "IP Image" selected here. This shows you the pre-processed version of this page handed to the OCR engine (as altered by the IP Profile set on the OCR Profile).

As you can see here, this diagnostic image does not contain any table lines, whereas the actual page does. This table lines were removed temporarily by a Line Removal step in the IP Profile (a very common temporary image processing adjustment to improve OCR results).

The other diagnostic images have to do with Grooper's Synthesis settings. You can learn more about these settings and these diagnostics images by visiting the Synthesis article.

Last but certainly not least, there is always an "Execution Log" file at the bottom of this diagnostics panel. This file is a text file detailing information about the OCR operation. This file can be particularly helpful when configuring Grooper's Synthesis settings as well.

Use Cases

OCR Profiles are required to obtain machine readable text from any image based content. Based on the image quality or source document quality, this may range from a relatively simply configured OCR Profile, perhaps just setting the OCR engine to be used, to a more complex one, taking advantage of temporary image processing, Grooper's Synthesis suite, or Result Filtering settings.

The only time you won't use an OCR Profile to obtain machine readable text is if you are only processing documents with full native text. These would be digital documents like a PDF created with encoded text already present that can be extracted via the Native Text Extraction functionality of the Recognize activity.

How To

Create an OCR Profile

Step 1Step 2

Add a New OCR Profile to the Node Tree

Creating an OCR Profile is fairly straight forward. OCR Profiles may be created and stored in a Content Model's local resources folder or in the OCR Profiles folder in the Node Tree (which is found in the Global Resources folder). However, the most common place to create an OCR Profile is in the OCR Profiles folder.

Navigate to the OCR Profiles folder in the Global Resources folder in the Node Tree, following the path below. `Root Node > Global Resources > OCR Profiles` Right click the OCR Profiles folder and mouse over "Add" and select "OCR Profile..." Name the OCR Profile whatever you like and select "OK" to create it. For this exercise we just named ours "OCR Profile Example"
This will create a blank OCR Profile in the OCR Profiles folder.

Configure the OCR Profile

Bare minimum, you will need to select an OCR engine in order to obtain OCR results. To do this, select the OCR Engine property and choose an OCR engine from the dropdown menu.
Configure the rest of the OCR Profile's properties according to your documents' needs. General information about these properties can be found in the About section of this article.

Execute an OCR Profile

Now that you have made and configured an OCR Profile, how do you execute it? OCR results are obtained by the Recognize activity. This activity will perform OCR on documents based on the settings in an OCR Profile. You will run this activity in one of two ways in Grooper:

Manual or "ad hoc" while testing and configuring within Grooper Design Studio.
As a step in a Batch Process.

Manual - Single PageManual - All Pages in a BatchBatch Process

At any point you can get to a Batch Viewer in Grooper, you can execute various activities manually on a page, folder or entire batch. This manual execution of activities is typical when building and testing your solution design in Grooper Design Studio.

To manually apply an OCR Profile to a single page:

Navigate to a page in a Batch and right click it.
Select "Activities" from the selection list.
Select "Recognize" from the second selection list.

On the following pop up window:

Select the OCR Profile property, and select the OCR Profile you wish to apply from the dropdown list.
Press the "Ok" to run the Recognize activity using the selected OCR Profile on the page.

To manually apply an OCR Profile to all pages within a Batch:

Navigate to a Batch and right click it.
Select "Contents" from the selection list.
Select "Apply Activity..." from the second selection list.

On the following pop up window:

Change the Activity Type property to Recognize.
Expand the Activity settings and change the OCR Profile property to the OCR Profile you wish to use.
Change the Scope property to Page.
Press the "Execute" button to run the Recognize activity on all pages in the Batch, using the selected OCR Profile's settings.

For automated Batch Processing, you'll want to add a Recognize step and configure it to use the OCR Process.

On a working Batch Process press the "Add" button to add a new step.
In the "Step Properties" panel, using Activity Type property, select Recognize from the dropdown menu.
In the "Activity Properties" panel, using the OCR Profile property, select the OCR Profile you wish to apply.