Recognize (Activity)

From Grooper Wiki

This article is about the current version of Grooper.

Note that some content may still need to be updated.

2025 2023.12.80

format_letter_spacing_wide Recognize is an Activity that obtains machine-readable text from contract Batch Pages and folder Batch Folders. When properly configured with an library_booksOCR Profile, Recognize will selectively perform OCR for images and native-text extraction for digital text in PDFs. Recognize can also reference an perm_mediaIP Profile to collect "layout data" like lines, checkboxes, and barcodes. Other Activities then use this machine-readable text and layout data for document analysis and data extraction.

You may download and import the file(s) below into your own Grooper environment (version 2025). There is a Batch with the example document(s) discussed in this tutorial, as well as a Project configured according to its instructions.


About

Text data is important for almost everything you do in Grooper. As part of the Second Phase of Grooper (Condition), the Recognize activity obtains text data from scanned or imported image-based documents and imported digital documents with native machine-readable text already embedded. For image-based documents, the activity can reference and execute an OCR Profile, performing Optical Character Recognition (OCR) according to its settings. For digital documents, the activity will extract the embedded text.

Recognize can also locate and store layout data on a document. To do so, an IP Profile using IP Commands such as Line Detection or Box Detection must be referenced.

The Recognize activity can:

  1. Obtain machine readable native text from images by performing OCR on a page or embedded images in a PDF
  2. Extract native text attached to a document directly
  3. Extract Layout Data with a referenced IP Profile
  4. Any combination of the above at the document folder or page level

All of this makes text on a document readable to Grooper.

The Recognize activity, like any activity, must be assigned to a Batch Process Step as part of a Batch Process.

How to configure the Recognize Activity

Prereqs

There are a few things to consider prior to configuring your Recognize activity.

How are you Acquiring your documents into Grooper?

  • Scanning - Scanned documents come into Grooper as images. This will impact the how the Recognize Step is configured.
  • Import - Imported documents can have native text embedded in them, or they may be an image file without any native text.

Do your documents have native text?

  • 100% Native Text Present - This will significantly simplify your Recognize Step configuration since OCR is not required to obtain text from the document.
  • No Native Text Present - If there is no native text, then OCR must be run and an OCR Profile needs to be configured.
  • Both Native Text and Images with non-native Text Present - If you want to obtain the text from the images on the documents that do not have native text attached, then you will need to run OCR with an OCR Profile.

Do the documents contain layout data you wish Grooper to recognize?

  • Layout Data includes things like table lines, barcodes, and OMR check boxes that, while do not contain text information, are helpful in understanding the format of a document.
  • Layout Data can only be obtained through Image Processing, so an IP Profile will need to be configured and referenced by your Recognize Step.

Adding a Recognize Batch Process Step

Before you can configure your Recognize Activity, you must assign it to a Batch Process Step in a Batch Process in your Project.


Click here for an interactive walkthrough

Obtaining Native Text from documents

By default, the Recognize Activity is set to obtain all embedded native text from a document. If you wish to change this, you can edit the Native Text Extraction property.

  • Full is the default setting on the Recognize Step and will extract all text from the document, including form fields and annotations.
  • Simple will only extract native text segments from the document.
  • None will not extract any native text from the document and OCR must be run to obtain text data.


Click here for an interactive walkthrough

Running OCR to obtain Text from documents

If the documents you are processing are image files (scanned documents, TIFF files, etc.) or are files that contain images you wish to extract text from, you will need to run OCR on the documents to obtain the text.

Running OCR on a document requires a configured OCR Profile.

To reference an OCR Profile on your Recognize Step, open the "Activity" sub properties then click the hamburger icon to the right of the OCR Profile property. Navigate to and select the OCR Profile object you wish to reference.


If your documents have both native text and images that contain non-native text, there is one more property you need to consider. The OCR Assist property will allow you the option of performing OCR on all pages of your document, or just the documents that have image content.

  • Auto will selectively apply OCR to PDF pages containing images.
  • Always will perform OCR on all PDF pages.
  • None will not perform OCR on PDF pages.

OCR results will be combined with native text extraction to return all of the recognized text.


Click here for an interactive walkthrough

The Correct Orientation property

When working with images, especially documents that have been acquired via scanner, sometimes documents can come into Grooper in the wrong orientation. A landscape page may have been scanned in a portrait orientation or a page might be upside down. It is recommended that with such documents that the Correct Orientation property is set from False to True. With this property set to True, Grooper will attempt to properly orient document pages as they are Recognized.

Obtaining Layout Data

To obtain layout data from a document, you must reference an IP Profile that contains steps for detecting the features you want included in the layout data. There are a few ways to do this depending on how you are Recognizing text from your documents.

If you are performing OCR, but are specifically using the Azure OCR Engine in your OCR Profile, nothing special needs to be done to obtain layout data. Azure OCR has a built in feature that runs a second pass of Traditional OCR on the documents after Azure OCR to obtain more accurate text positioning data. As part of this second pass that runs in the background, an IP Profile that contains feature detection is also run on the documents.


If performing Traditional OCR, you can include an IP Profile containing commands for feature detection in the referenced OCR Profile object itself.


If you are NOT using OCR, you can set an IP Profile for feature detection using the Alternate IP property under the "PDF Options" heading.

FYI

The Alternate IP property is only useful for obtaining layout data. If you wish to apply permanent or temporary IP for OCR purposes, you will need to use one of the other methods of applying an IP Profile.


Click here for an interactive walkthrough


It is recommended to collect Layout data on the Page level rather than the Folder level to avoid problems during data extraction. Before running the Recognize activity, PDFs should be split using the Split Pages Activity. This will create individual Page objects nested under the PDF Batch Folder. Then, Recognize's Scope property should be set to Page before running.

How to view the Recognized information

Layout View

Below is an example of the Layout View of a document after the Recognize activity has run. The highlighted text segments were identified by the OCR Engine. The bold red lines represent the layout data that was detected by the referenced IP Profile.


To access the Layout View:

  1. Go to any area of Grooper that has a Document Viewer.
  2. In the top right corner of the Document Viewer, click on the Renditions icon.
  3. Select "Layout" from the drop down.

Click here for an interactive walkthrough

Layout Data

The actual layout data is saved as a .json file on the associated Batch Page or Batch Folder object, depending at which Scope the Recognize activity ran. If you open up the .json file, the layout data should look something like the image below.


Text View

This is a "Text View" of the recognized text from running the Recognize activity. This is the actual text that will be used to separate pages into organized documents, classify documents, extract data, or a number of other Grooper activities using text data.


To access the Text View:

  1. Go to any area of Grooper that has a Document Viewer.
  2. In the top right corner of the Document Viewer, click on the Renditions icon.
  3. Select "Text" from the drop down.

Click here for an interactive walkthrough

Character Data

This is the individual character data for each character recognized, including character confidence and position information.

The character data is saved as a .txt file on the associated Batch Page or Batch Folder object, depending at which level the Recognize activity ran.

To access the Character Data, either open the .txt file attached to the Batch object, or for the view with the information organized as seen below, look at the Tester tab of an OCR Profile, and you will see a tab that offers a Character View.