Recognize (Activity)

From Grooper Wiki
Revision as of 13:50, 18 April 2025 by Rpatton (talk | contribs) (// via Wikitext Extension for VSCode)

This article is about the current version of Grooper.

Note that some content may still need to be updated.

2025 2023.12.80


WIP

This article is a work-in-progress or created as a placeholder for testing purposes. This article is subject to change and/or expansion. It may be incomplete, inaccurate, or stop abruptly.

This tag will be removed upon draft completion.

format_letter_spacing_wide Recognize is an Activity that obtains machine-readable text from contract Batch Pages and folder Batch Folders. When properly configured with an library_booksOCR Profile, Recognize will selectively perform OCR for images and native-text extraction for digital text in PDFs. Recognize can also reference an perm_mediaIP Profile to collect "layout data" like lines, checkboxes, and barcodes. Other Activities then use this machine-readable text and layout data for document analysis and data extraction.

You may download and import the file(s) below into your own Grooper environment (version 2023.1). There is a Batch with the example document(s) discussed in this tutorial, as well as a Project configured according to its instructions.


About

Text data is important for almost everything you do in Grooper. As part of the Second Phase of Grooper (Condition) the Recognize activity obtains text data from scanned or imported image-based documents and imported digital documents with native machine-readable text already embedded. For image-based documents, the activity can reference and execute an OCR Profile, performing Optical Character Recognition (OCR) according to its settings. For digital documents, the activity will extract the embedded text.

Recognize can also locate and store layout data on a document. To do so, an IP Profile using IP Commands such as Line Detection or Box Detection must be referenced.

The Recognize activity can:

  1. obtain machine readable native text from images by performing OCR on a page or embedded images in a PDF
  2. extract native text attached to a document directly
  3. extract Layout Data with a referenced IP Profile
  4. any combination of the above at the document folder or page level

FYI

For mixed documents with a combination of native text and images, Recognize will always collect the native text and then OCR the images of the document.

All of this makes text on a document readable to Grooper.

The Recognize activity, like any activity, must be assigned to a Batch Process Step as part of a Batch Process.

How To: Configure the Recognize Activity

Prereqs

There are a few things to consider prior to configuring your Recognize activity.

How are you Acquiring your documents into Grooper?

  • Scanning - Scanned documents come into Grooper as images. This will impact the how the Recognize Step is configured.
  • Import - Imported documents can have native text embedded in them, or they may be an image file without any native text.

Do your documents have native text?

  • Native Text Present - This will significantly simplify your Recognize Step configuration.
  • No Native Text Present - If there is no native text, then OCR must be run and an OCR Profile needs to be configured.
  • Both Native Text and Images with non-native Text Present - If you want to obtain the text from the images on the documents that do not have native text attached, then you will need to run OCR with an OCR Profile.

Do the documents contain layout data you wish Grooper to recognize?

  • Layout Data includes things like table lines, barcodes, and OMR check boxes that, while do not contain text information, are helpful in understanding the format of a document.
  • Layout Data can only be obtained through Image Processing, so an IP Profile will need to be configured and referenced by your Recognize Step.

Adding a Recognize Batch Process Step

Before you can configure your Recognize Activity, you must assign it to a Batch Process Step in a Batch Process in your Project.

Click here for an interactive walkthrough

Obtaining Native Text from documents

By default, the Recognize Activity is set to obtain all embedded native text from a document. If you wish to change this, you can edit the Native Text Extraction property.

  • Full is the default setting on the Recognize Step and will extract all text from the document, including form fields and annotations.
  • Simple will only extract native text segments from the document.
  • None will not extract any native text from the document and OCR must be run to obtain text data.


Click here for an interactive walkthrough

Running OCR to obtain Text from documents

If you are processing solely image files, all you need to do is set an OCR Profile. Select the OCR Profile property of the Recognize activity and expand the dropdown list. Select the desired OCR Profile from the list.


If your documents are PDFs with image-based content containing text information that is not embedded in the PDF, you will also need to set an OCR Profile here. Furthermore, under PDF Options, you will need to set the OCR Assist property to either Auto or Always.

  • Auto will selectively apply OCR to PDF pages containing images.
  • Always will perform OCR on all pages.

OCR results are then combined with native text extraction to form a single complete text output for the document.

Obtaining Layout Data

Layout data is obtained from image feature detection commands on an IP Profile. If performing OCR, include an IP Profile containing these commands using the IP Profile property of the OCR Profile.


If you are not using OCR, you may set an IP Profile using the Alternate IP property under the "PDF Options" heading. In the image below an IP Profile named "Layout Data" is referenced, use the "OCR Cleanup" IP Profile instead if you are using the supplied materials for this article.


Layout data may need to be collected on the page level rather than the folder level to avoid problems during data extraction. Before running the Recognize activity, PDFs should be split using the "Split Pages" activity against the document. This will create individual page objects nested under the PDF document folder. Then, Recognize's Scope should be set to "Page" before running.


Layout View

Below is an example of the Layout View of a document after the Recognize activity has run. The highlighted text segments were identified by the OCR Engine. The bold red lines represent the layout data that was detected by the referenced IP Profile.

To access the Layout View of a Document, when in a Document Viewer, click on the Renditions icon and select "Layout View" from the menu.

Click here for an interactive walkthrough

Text Data

This is a "text view" of the extracted text from OCR during the Recognize activity. This text will be used to separate pages into documents, classify documents according to how they are defined in a Content Model, extract Data Fields, or a number of other Grooper activities using text data.

Character Data

This is the individual character data for each character recognized, including character confidence and position information.

The character data is saved as a .txt file on the associated Batch Page or Batch Folder object, depending at which level the Recognize activity ran.

Layout Data

This is the layout information obtained from the activity. Layout information can yield visual information useful to further Grooper activities, such as whether or not checkboxes on a form are checked, or table line positions for extracting tabular data.

The layout information is saved as a .json file on the associated Batch Page or Batch Folder object, depending at which Scope the Recognize activity ran.