2023.1:Recognize (Activity)

From Grooper Wiki

This article is about an older version of Grooper.

Information may be out of date and UI elements may have changed.

20252023.12.80
The Recognize activity's property panel.

format_letter_spacing_wide Recognize is an Activity that obtains machine-readable text from contract Batch Pages and folder Batch Folders. Recognize will selectively perform OCR for images and native-text extraction for digital text in PDFs. Recognize can also be configured to collect "layout data" like lines, checkboxes, and barcodes. Various other Activities then use this machine-readable text and layout data for document analysis and data extraction.

You may download and import the file(s) below into your own Grooper environment (version 2023.1). There is a Batch with the example document(s) discussed in this tutorial, as well as a Project configured according to its instructions.


About

Text data is critically important for most everything you do in Grooper. The Recognize activity obtains text data from both scanned (or otherwise image-based) documents and digital documents with native machine-readable text already embedded. For image-based documents, the activity will execute an OCR Profile, performing OCR according to its settings. For digital documents, the activity will extract the embedded text.

Recognize can also locate and store any layout data obtained from a referenced IP Profile using IP Commands such as Line Detection or Box Detection .

The Recognize activity can:

  1. get machine readable text from images by performing OCR on a page or embedded images in a PDF,
  2. extract native PDF text directly
  3. extract Layout Data or
  4. any combination at the document folder or page level.

Recognize is useful in all cases when a document needs text rendered readable to Grooper, either from a native electronic format or via OCR from an image or scan.  Recognize represents a simplified metaphor for gathering information (text and layout) from documents. Recognize is a backbone activity for most other use cases in Grooper.

Below is an example of an image processed by the Recognize activity and its results.

Image

This is the original image after being run through the Recognize activity. The highlighted portions are text segments the OCR Profile identified and captured text data from. You can view OCR and extracted PDF text by right-clicking a page or folder in a Batch (depending on which level the activity ran), selecting "Item Properties" and pressing the ellipsis button at the end of the Results property.

Layout View

This is the character results displayed in a "layout view". The layout view is a visual representation mimicking the document's structure from the image, using the text data and their character positions obtained from the Recognize activity.

Text Data

This is a "text view" of the extracted text from OCR during the Recognize activity. This text will be used to separate pages into documents, classify documents according to how they are defined in a Content Model, extract Data Fields, or a number of other Grooper activities using text data.

Character Data

This is the individual character data for each character recognized, including character confidence and position information.

The character data is saved as a .txt file on the associated Batch Page or Batch Folder object, depending at which level the Recognize activity ran.

Layout Data

This is the layout information obtained from the activity. Layout information can yield visual information useful to further Grooper activities, such as whether or not checkboxes on a form are checked, or table line positions for extracting tabular data.

The layout information is saved as a .json file on the associated Batch Page or Batch Folder object, depending at which Scope the Recognize activity ran.

How To: Configure the Activity

Prereqs

First, you will need to identify your plan of attack by analyzing your documents. Here are a few questions to get you started:

  • How are your documents coming into Grooper?
    Primarily, Recognize is for getting text from documents. If your documents are coming in as image files, such as .tiff, you will need to obtain text through OCR. If you are scanning documents using a physical scanner, this will likely be your route. If you need OCR text, you will also need to create an OCR Profile. If your documents are coming in as PDFs, you will need to answer a couple more questions.
  • Do you have PDF documents with native text embedded in them?
    If so, you will want to to obtain PDF text through native text extraction. If you have PDF documents that do not have native text embedded, or you do not wish to extract the embedded text because it is inaccurate, you will want to obtain text through OCR. If you need OCR text, you will also need to create an OCR Profile.
  • Do you have PDF documents with images containing text as well as embedded text?
    If so, you will want to use both OCR and native text extraction.
  • Do your documents have layout data, such as table lines, barcodes, and OMR checkboxes, needed to extract data, classify documents, or other document processing activites?
    If so, you will need to create an IP Profile containing the associate image feature detection commands, such as Line Detection, Barcode Detection, and Box Detection.

Obtaining OCR Text

If you are processing solely image files, all you need to do is set an OCR Profile. Select the OCR Profile property of the Recognize activity and expand the dropdown list. Select the desired OCR Profile from the list.


If your documents are PDFs with image-based content containing text information that is not embedded in the PDF, you will also need to set an OCR Profile here. Furthermore, under PDF Options, you will need to set the OCR Assist property to either Auto or Always.

  • Auto will selectively apply OCR to PDF pages containing images.
  • Always will perform OCR on all pages.

OCR results are then combined with native text extraction to form a single complete text output for the document.

Obtaining PDF Text

To extract native text from PDFs, set the Native Text Extraction property to Full or Simple.

  • Full will extract all text from the document, including form fields.
  • Simple will only extract native text segments.

Obtaining Layout Data

Layout data is obtained from image feature detection commands on an IP Profile. If performing OCR, include an IP Profile containing these commands using the IP Profile property of the OCR Profile.


If you are not using OCR, you may set an IP Profile using the Alternate IP property under the "PDF Options" heading. In the image below an IP Profile named "Layout Data" is referenced, use the "OCR Cleanup" IP Profile instead if you are using the supplied materials for this article.


Layout data may need to be collected on the page level rather than the folder level to avoid problems during data extraction. Before running the Recognize activity, PDFs should be split using the "Split Pages" activity against the document. This will create individual page objects nested under the PDF document folder. Then, Recognize's Scope should be set to "Page" before running.