2023:OCR Profile (Node Type)

From Grooper Wiki

This article is about an older version of Grooper.

Information may be out of date and UI elements may have changed.

202520232.80
An example of a configured OCR Profile's property settings

library_books OCR Profiles store configuration settings for optical character recognition (OCR). They are used by the Recognize activity to convert images of text on contract Batch Pages into machine-encoded text. OCR Profiles are highly configurable, allowing fine-grained control over how OCR occurs, how pre-OCR image cleanup occurs, and how Grooper's OCR Synthesis occurs. All this works to the end goal of highly accurate OCR text data, which is used to classify documents, extract data and more.

This includes:

  • Setting which OCR Engine is used
  • Determining whether a temporary IP Profile is used for image cleanup before the OCR engine runs
  • Grooper's unique Synthesis settings
    • Determining if and how multiple OCR results are pre-processed and re-processed
  • If and how results are filtered, to toss out undesirable results.
  • Any configurable settings available from the OCR Engine

You may download the ZIP(s) below and upload it into your own Grooper environment (version 2023). The first contains a Project with resources used in examples throughout this article. The second contains one or more Batches of sample documents.

Glossary

Activity: Grooper Activities define specific document processing operations done to a inventory_2 Batch, folder Batch Folder, or contract Batch Page. In a settings Batch Process, each edit_document Batch Process Step executes a single Activity (determined by the step's "Activity" property).

  • Batch Process Steps are frequently referred by the name of their configured Activity followed by the word "step". For example: "Classify step".

Batch Page: contract Batch Page nodes represent individual pages within a inventory_2 Batch. Batch Pages are created in one of two ways: (1) When images are scanned into a Batch using the Scan Viewer. (2) Or, when split from a PDF or TIFF file using the Split Pages activity.

  • Batch Pages are frequently referred to simply as "pages".

Batch Process: settings Batch Process nodes are crucial components in Grooper's architecture. A Batch Process is the step-by-step processing instructions given to a inventory_2 Batch. Each step is comprised of a "Code Activity" or a Review activity. Code Activities are automated by Activity Processing services. Review activities are executed by human operators in the Grooper user interface.

  • Batch Processes by themselves do nothing. Instead, they execute edit_document Batch Process Steps which are added as children nodes.
  • A Batch Process is often referred to as simply a "process".

Batch: inventory_2 Batch nodes are fundamental in Grooper's architecture. They are containers of documents that are moved through workflow mechanisms called settings Batch Processes. Documents and their pages are represented in Batches by a hierarchy of folder Batch Folders and contract Batch Pages.

Content Model: stacks Content Model nodes define a classification taxonomy for document sets in Grooper. This taxonomy is defined by the collections_bookmark Content Categories and description Document Types they contain. Content Models serve as the root of a Content Type hierarchy, which defines Data Element inheritance and Behavior inheritance. Content Models are crucial for organizing documents for data extraction and more.

Document Viewer: The Grooper Document Viewer is the portal to your documents. It is the UI that allows you to see a folder Batch Folder's (or a contract Batch Page's) image, text content, and more.

Execute: tv_options_edit_channels Execute is an Activity that runs one or more specified object commands. This gives access to a variety of Grooper commands in a settings Batch Process for which there is no Activity, such as the "Sort Children" command for Batch Folders or the "Expand Attachments" command for email attachments.

Extract: export_notes Extract is an Activity that retrieves information from folder Batch Folder documents, as defined by Data Elements in a data_table Data Model. This is how Grooper locates unstructured data on your documents and collects it in a structured, usable format.

IP Profile: perm_media IP Profiles are a step-by-step list of image processing operations (IP Commands). They are used for several image processing related operations, but primarily for:

  1. Permanently enhancing an image during the Image Processing activity (usually to get rid of defects in a scanned image, such as skewing or borders).
  2. Cleaning up an image in-memory during the Recognize activity without altering the image to improve OCR accuracy.
  3. Computer vision operations that collect layout data (table line locations, OMR checkboxes, barcode value and more) utilized in data extraction.

Line Removal: Line Removal is an IP Command that locates and removes horizontal and vertical lines from documents. The detected line locations are stored as part of page's layout data.

Node Tree: The Node Tree is the hierarchical list of Grooper node objects found in the left panel in the Design Page. It is the basis for navigation and creation in the Design Page.

OCR Engine: An "OCR engine" is the part of OCR software that recognizes text from images. OCR engines analyze the image's pixels to determine where text is on the page and what each character is. In Grooper, OCR engines are selected when configuring an OCR Profile's OCR Engine property.

OCR Profile: library_books OCR Profiles store configuration settings for optical character recognition (OCR). They are used by the Recognize activity to convert images of text on contract Batch Pages into machine-encoded text. OCR Profiles are highly configurable, allowing fine-grained control over how OCR occurs, how pre-OCR image cleanup occurs, and how Grooper's OCR Synthesis occurs. All this works to the end goal of highly accurate OCR text data, which is used to classify documents, extract data and more.

OCR: OCR is stands for Optical Character Recognition. It allows text on paper documents to be digitized, in order to be searched or edited by other software applications. OCR converts typed or printed text from digital images of physical documents into machine readable, encoded text.

Project: package_2 Projects are the primary containers for configuration nodes within Grooper. The Project is where various processing objects such as stacks Content Models, settings Batch Processes, profile objects are stored. This makes resources easier to manage, easier to save, and simplifies how node references are made in a Grooper Repository.

Recognize: format_letter_spacing_wide Recognize is an Activity that obtains machine-readable text from contract Batch Pages and folder Batch Folders. When properly configured with an library_booksOCR Profile, Recognize will selectively perform OCR for images and native-text extraction for digital text in PDFs. Recognize can also reference an perm_mediaIP Profile to collect "layout data" like lines, checkboxes, and barcodes. Other Activities then use this machine-readable text and layout data for document analysis and data extraction.

Root: The Grooper database Root node is the topmost element of the Grooper Repository. All other nodes in a Grooper Repository are its children/descendants. The Grooper Root also stores several settings that apply to the Grooper Repository, including the license serial number or license service URL and Repository Options.

Test Batch: "Test Batch" is a specialized Import Provider designed to facilitate the import of content from an existing inventory_2 Batch in the test environment. This provider is most commonly used for testing, development, and validation scenarios, and is not intended for production use.

  • Looking for information on "production" vs "test" Batches in Grooper? See here.

About

At first glance, an OCR profile may look like a wall of properties, and in some ways, it is. They are a way to save a collection of properties that determine how OCR results are obtained. Let's break these properties down, using a configured OCR Profile as an example.

Below you will see one of the default OCR Profiles that ship with Grooper named "Full Text - Accurate", with these settings highlighted in each tab.

2023:OCR

The OCR Testing Tab

When you select them in the Node Tree, OCR Profiles also contain an "OCR Testing" tab to verify results of the profile. This will pull up a testing module, allowing us to select documents from a Test Batch, OCR individual pages, and view some extra diagnostic information that will help fine tune your property settings.

  1. To access this testing module, first select an OCR Profile in the Node Tree. Here, we've selected the "Full Text - Accurate" OCR Profile that comes with all Grooper installations.
    • You can follow this path in the Node Tree to find this OCR Profile:
    • Root Node > Projects > Essentials > Profiles > Full Text - Accurate
  2. Click the "Tester" tab.
  1. From here, click the Batch selection icon.
  1. Select a Test Batch from the dropdown list. Here, we're selecting a Test Batch named "Application for Cow Ownership".

  1. Click the test icon.
  2. You will see the selected page appear in the "Document Viewer" window.

Once OCR is finished you will see OCR results appear in the "Layout View" tab in the bottom of the screen. You can look at the "Text View" or "Character View" by selecting the corresponding tabs.

  1. After testing the OCR, you can click the "Diagnostics" icon to take a look at the diagnostic tools available.

  1. The "Diagnostics" window that opens in a new tab provides a variety of diagnostics images just as the "IP Image" selected here. This shows you the pre-processed version of this page handed to the OCR engine (as altered by the IP Profile set on the OCR Profile).

As you can see here, this diagnostic image does not contain any table lines, whereas the actual page does. This table lines were removed temporarily by a Line Removal step in the IP Profile (a very common temporary image processing adjustment to improve OCR results).

The other diagnostic images have to do with Grooper's Synthesis settings. You can learn more about these settings and these diagnostics images by visiting the Synthesis article.

Last but certainly not least, there is always an "Execution Log" file at the bottom of this diagnostics panel. This file is a text file detailing information about the OCR operation. This file can be particularly helpful when configuring Grooper's Synthesis settings as well.

Click here to return to the top of this section

Use Cases

OCR Profiles are required to obtain machine readable text from any image based content. Based on the image quality or source document quality, this may range from a relatively simply configured OCR Profile, perhaps just setting the OCR engine to be used, to a more complex one, taking advantage of temporary image processing, Grooper's Synthesis suite, or Result Filtering settings.

The only time you won't use an OCR Profile to obtain machine readable text is if you are only processing documents with full native text. These would be digital documents like a PDF created with encoded text already present that can be extracted via the Native Text Extraction functionality of the Recognize activity.

How To

Create an OCR Profile

Add a New OCR Profile to the Node Tree

Creating an OCR Profile is fairly straight forward. OCR Profiles may be created and stored in a Content Model's local resources folder.

  1. Navigate to the Local Resources folder in your Content Model folder in the Node Tree. Feel free to add folders under your Local Resources folder for organization. Here we have created a folder called "OCR Profiles". Right click on the folder.
  2. Mouse over "Add"...
  3. ...and click "OCR Profile..."

  1. Name the OCR Profile whatever you like.
  2. Click "EXECUTE" to create it.
    • For this exercise we just named ours "OCR Profile Example."

This will create a blank OCR Profile in the OCR Profiles folder.

Configure the OCR Profile

  1. Bare minimum, you will need to select an OCR engine in order to obtain OCR results. To do this, select the OCR Engine property and choose an OCR engine from the dropdown menu.

Configure the rest of the OCR Profile's properties according to your documents' needs. General information about these properties can be found in the About section of this article.

Click here to return to the top of this section

Execute an OCR Profile

Now that you have made and configured an OCR Profile, how do you execute it? OCR results are obtained by the Recognize activity. This activity will perform OCR on documents based on the settings in an OCR Profile. You will run this activity in one of two ways in Grooper:

  1. Manual or "ad hoc" while testing and configuring within Grooper Design Studio.
  2. As a step in a Batch Process.

At any point you can get to a Batch Viewer in Grooper, you can execute various activities manually on a page, folder or entire batch. This manual execution of activities is typical when building and testing your solution design in Grooper Design Studio.

To manually apply an OCR Profile to a single page:

  1. Navigate to a page in a Batch and right click it.
  2. Select "Activities" and then "Cleanup & Recognition" from the selection list.
  3. Click "Recognize" from the final selection list.

On the following pop up window:

  1. Click the hamburger menu icon next to the OCR Profile property.
  2. Select the OCR Profile you wish to apply from the dropdown list.
  3. Click the "EXECUTE" button to run the Recognize activity using the selected OCR Profile on the page.

For automated Batch Processing, you'll want to add a Recognize step and configure it to use the OCR Process.

  1. Right click on a Batch Process.
  2. Hover over "Add Activity" and then "Cleanup & Recognition".
  3. Click on "Recognize..." to add a Recognize step.

  1. When the "Add Activity" window pops up, click "EXECUTE".

  1. In the "Activity Properties" panel, click the hamburger menu next to the OCR Profile property.
  2. Select the OCR Profile desired from the drop down menu.
Click here to return to the top of this section