The Recognize activity obtains a document's content, including text and Layout Data, and saves it for use by further processing activities, such as document classification and data extraction.
New to version 2.80, Recognize replaces the OCR and PDF Text Extract activities with a robust, computer-vision based approach to comprehension of semantic content (all of the information-carrying elements on a document). Recognize has the ability to target the entire document layout - including text, barcodes, lines, OMR boxes, and other layout information. You no longer must apply a separate activity depending on if you are working with document images or PDFs with native text. The Recognize Activity can (1) get machine readable text from images by performing OCR on a page or embedded images in a PDF, (2) extract native PDF text directly, (3) extract Layout Data or (4) any combination at the document folder or page level.
Below is an example of an image processed by the Recognize activity and its results.
This is the original image after being run through the Recognize activity. The highlighted portions are text segments the OCR Profile identified and captured text data from. You can view OCR and extracted PDF text by right-clicking a page or folder in a batch (depending on which level the activity ran), selecting "Item Properties" and pressing the ellipsis button at the end of the "Results" property.
This is the character results displayed in a "layout view". The layout view is a visual representation mimicking the document's structure from the image, using the text data and their character positions obtained from the Recognize activity.
This is a "text view" of the extracted text from OCR during the Recognize activity. This text will be used to separate pages into documents, classify documents according to how they are defined in a Content Model, extract data fields, or a number of other Grooper activities using text data.
This is the individual character data for each character recognized, including character confidence and position information.
The character data is saved as a .txt file on the associated page or folder object, depending at which level the Recognize activity ran.
This is the layout information obtained from the activity. Layout information can yield visual information useful to further Grooper activities, such as whether or not checkboxes on a form are checked, or table line positions for extracting tabular data.
The layout information is saved as a .json file on the associated page or folder object, depending at which level the Recognize activity ran.
Prior to version 2.80, obtaining text data was performed by the OCR activity for document images and the PDF Text Extract activity for native text on PDF documents. Layout Data, including lines, OMR checkboxes, and barcodes, was obtained through an Image Processing activity. The guiding idea behind the creation of the Recognize activity was the need for a single method of dealing with both electronic and scanned PDF documents while also taking into account other features of the document. The Recognize activity adds a number of abilities not present in previous versions of Grooper:
- The ability to run on the document level (the Folder level).
(Note: May run slowly on larger documents, for which splitting is still recommended before recognizing.)
- The ability to run on a subset of pages, specified from the beginning or end of the document
- More sophisticated abilities to deal with PDFs that have both images and native text
- Can read layout data from the document and merge with or replace Layout Data generated by IP steps
Layout Data could also be obtained when running the Full-Text OCR activity prior to 2.80 by performing temporary image processing before the OCR engine runs. This would be done by including an IP Profile containing layout detection commands (such as Line Detection or Line Removal) on the "IP Profile" property of the OCR Profile used. When running the OCR activity the "Save Layout Data" property must be set to "True" to save the Layout Data to the page object.
Recognize is useful in all cases when a document needs text rendered readable to Grooper, either from a native electronic format or via OCR from an image or scan. Recognize represents a simplified metaphor for gathering information (text and layout) from documents. Recognize is a backbone activity for most other use cases in Grooper.
How To: Configure the Activity
First, you will need to identify your plan of attack by analyzing your documents. Here are a few questions to get you started:
How are your documents coming into Grooper? Primarily, Recognize is for getting text from documents. If your documents are coming in as image files, such as .tiff, you will need to obtain text through OCR. If you are scanning documents using a physical scanner, this will likely be your route. If you need OCR text, you will also need to create an OCR Profile. If your documents are coming in as PDFs, you will need to answer a couple more questions.
Do you have PDF documents with native text embedded in them? If so, you will want to to obtain PDF text through native text extraction. If you have PDF documents that do not have native text embedded, or you do not wish to extract the embedded text because it is inaccurate, you will want to obtain text through OCR. If you need OCR text, you will also need to create an OCR Profile.
Do you have PDF documents with images containing text as well as embedded text? If so, you will want to use both OCR and native text extraction.
Do your documents have layout data, such as table lines, barcodes, and OMR checkboxes, needed to extract data, classify documents, or other document processing activites? If so, you will need to create an IP Profile containing the associate image feature detection commands, such as Line Detection, Barcode Detection, and Box Detection.
If you are processing solely image files, all you need to do is set an OCR Profile. Select the "OCR Profile" property of the Recognize activity and expand the dropdown list. Select the desired OCR Profile from the list.
If your documents are PDFs with image-based content containing text information that is not embedded in the PDF, you will also need to set an OCR Profile here. Furthermore, under "PDF Options", you will need to set the "OCR Assist" property to either "Auto" or "Always".
"Auto" will selectively apply OCR to PDF pages containing images. "Always" will perform OCR on all pages. OCR results are then combined with native text extraction to form a single complete text output for the document.
To extract native text from PDFs, set the "Native Text Extraction" property to "Full" or "Simple".
"Full" will extract all text from the document, including form fields. "Simple" will only extract native text segments.
Layout data is obtained from image feature detection commands on an IP Profile. If performing OCR, include an IP Profile containing these commands using the "IP Profile" property of the OCR Profile.
You will see which layout features are detected by the selected OCR Profile under the "Layout Detection Summary". This is a read only property that will only appear if layout features are detected by the OCR Profile.
If you are not using OCR, you may set an IP Profile using the "Alternate IP" property under the "PDF Options" heading.
|!||Layout data may need to be collected on the page level rather than the folder level to avoid problems during data extraction. Before running the Recognize activity, PDFs should be split using the "Content Action" activity and choosing the "Split" action in its properties. This will create individual page objects nested under the PDF document folder. Then, Recognize's Scope should be set to "Page" before running.|
|OCR Profile||Here, you will select the OCR Profile to be used for text recognition. OCR Profiles can be created and stored in the OCR Profiles folder of the Global Resources folder. Optionally, an IP Profile can be assigned to detect layout data such as lines, checkboxes, barcodes and shapes if needed for use in data extraction.|
|Page Filter||This property restricts the Recognize activity to specific page numbers. If blank, recognition will be performed on all pages. Page numbers may be entered as a comma separate list. Entering "1" would run on the first page. Entering "1, 3, 5" would run on the first, third and fifth page. You may also enter a range of "X to Y" pages. Entering "1 to 5" would run on the first to fifth page. Using negative numbers will start at the end instead of beginning of the document. Entering "-1" would run on the last page, "-1, -3, -5" would run on the last, third from last and fifth from last page, "1, -1" would run on the first and last page, and so on.|
|Layout Detection Summary||This is a read-only property that will appear when you set an OCR Profile. If you have an IP Profile set on your OCR Profile, this will display the types of non-text features detected, such as lines, checkboxes, barcodes and shapes.|
|Native Text Extraction||Full||This property determines how native text from PDF files are extracted. You can extract text from PDFs in one of three ways:
|OCR Assist||Auto||PDFs often consist of "mixed content". While there may be native text embedded in the files, there may also be images with text on them. OCR Assist specifies the extent to which OCR supplements native text extraction. OCR results obtained are combined with with native text segments to produce a complete document. (Note: This option is only available on "Full" and "Simple" Native Text Extraction modes and only if you specify an OCR Profile under the "General" settings.) OCR Assist may be performed in one of three ways:
|Alternate IP||Here, you may set an optional IP Profile to detect a document's layout data. This is for situations where OCR is not used to obtain layout data.|