Zonal Extract

From Grooper Wiki
Jump to navigation Jump to search

Zonal Extract extracts text from documents within a user-defined rectangular area on a specified page.

This extraction method can be used on documents where you expect to find information in the same place on document with little to no variance. All text falling in the zone created will return as the data field’s result. So, if your targeted data is going to be in the same physical spot over all documents in your set, Zonal Extract is a way to easily capture it without building a complicated extractor.

About

1573227628151-318.png
Anything falling in the extraction zone, in this case the "contact name" will be extracted


Zonal Extract also has the capability to leverage line locations on the page to "snap" to an enclosed box near where the zone is placed.  The example below is the same extraction zone from the one above with automatic line snapping enabled.  This way, if there is a slight variance on where something is on the page (perhaps due to alignment issues when a document was scanned) you can still place a zone in the general area you know will be encapsulated within a larger target box on the page and the extraction zone will expand to the logical corners of that box.


1573229815226-789.png


Optionally, text from the zone can be reprocessed by a separate OCR profile. This can be a way to deal with tricky OCR issues that aren't easily handled by other methods.  Think about a check.  The routing and account number at the bottom is called the "MICR line".  It is so called because of the special font and ink those characters use in order to be read quickly by banking computer systems.  While easily read by specialized computers, typical OCR engines sometimes have a problem identifying the font.  But some OCR engines do a fantastic job at these fonts.  Even configuring one profile differently using the same engine can do the job. Also, the MICR line is often found in the same spot on different check types.  If you can count on the MICR line being in the same spot, you can place an extraction zone using Zonal Extract and choose a different OCR profile from the one used on the rest of the check to accurately read that information.

Check.jpg 1573228825171-712.png


Before you begin

Zonal Extract is an extraction method to populate Data Fields or Data Columns in a Data Model. As such, you will need to create a Content Model with a Data Model with a Data Field or Data Column data element.

Technically, you could use this extraction method without OCR text or extracted PDF text since there is an option to set an OCR profile during extraction. However, it is most likely you will want to have obtained text data from your documents during a Recognize activity first.

If you want to expand the extraction zone to the borders of a box, you must obtain the document’s Layout Data. To do this, run a Line Detection or Line Removal command on an IP Profile during a Recognize or Image Processing activity. These activities must be run on the Page level.

Set the Value Extractor to Zonal Extract

Navigate to the Data Field or Data Column you wish to populate in your Data Model. Select the "Value Extractor" property.

From the dropdown list, choose "Zonal Extract".


1573232570476-946.png


Draw the Extraction Zone

1. Expand the Value Extractor by double clicking it to reveal Zonal Extract's properties.

2. Set the page number you want to extract from. You may only enter one number. You can enter positive or negative page numbers. Positive numbers will start at the beginning of the document ("1" will be the first page, "2" the second, and so on).

Negative numbers will start at the end of the document ("-1" will be the last page, "-2" the second to last, and so on).


1573232671165-228.png


3. The "Region of Interest" property controls the size and location of the extraction zone. Select the ellipsis button at the end of the property to bring up the "Edit Zones" window.


1573232696656-420.png


You can use the controls here to draw a rectangle on the screen from which text will be extracted.


1573234659299-490.png


Configuring Auto Snap

Auto snap will expand the rectangular extraction zone by automatically snapping to nearby lines around the anchor. Select "Auto Snap" and change the property to "Enabled" to use this feature.


1573233172321-572.png


You can control the zone's size and behavior using the "Auto Snap Distance" and "Auto Snap Margin" properties.


1573233217811-104.png


"Auto Snap Distance" determines the maximum distance from the anchor the edges of the zone can snap to. This can be useful when dealing with documents with poor image quality where the line locations are not reliably known. Imagine an example like the one below. In this example, we're trying to use Anchored Extract to pull the last name off a set of documents. Here's how the zone should be snapping to nearby lines.


1573234702905-847.png


Let's say in 100% of documents structured like this, Grooper's Line Detection found the vertical line before the "3. Middle Initial" box, but only on 90% of the documents did it find the vertical line before the "2. First Name" box. Without defining a maximum snap distance, in those 10% where it didn't find the line, it would snap all the way to the second vertical line, and the contents of both the "1. Last Name" box and "2. Middle Name" would be extracted. The zone would look as seen in the image below.


1573234728931-443.png


"Auto Snap Margin" shrinks the size of the zone after it has snapped to detected lines. This way, you can limit what is captured by the zone. The example below uses Auto Snap Margin to ignore the label "1. Last Name" from the result. The "Top" property was changed to "0.3in" to shrink the left side by 0.3 inches.


1573235025888-649.png


Set the Extract Mode

The default value for the "Extract Mode" property is "None". This will perform no extraction and simply place a zone on the page.

If you want to extract text from the document, change "Extract Mode" to "Full Text" or "OCR"


1573235121565-470.png


"Full Text" will extract existing text from the document obtained from the Recognize activity.


1573235160771-319.png


"OCR" will allow you to set an OCR profile to reprocess OCR on the text in the extraction zone. This can be useful for documents where most of the document is read easily by one OCR profile, but the text in the extraction zone a read better by another one. You will need to set the OCR Profile to be used on the "OCR Profile" property.


1573235197666-761.png


In the example below, the document ran through Recognize with an OCR Profile using the Transym OCR engine to accurately find the document's labels. But when ran on "Full Text" mode the text extracted from the zone output as "bly a t t". This is because "Wyatt" is in the OCR-A font and Transym does not handle that font well. When ran on "OCR" mode and an OCR Profile using the Tesseract engine was used, "Wyatt" was extracted correctly. Tesseract has added functionality to handle the OCR-A font. This way both profiles and both OCR engines capabilities can be used to get the most accurate data off the document.


1573235025888-649.png


Remaining properties for FullText and OCR extraction modes

When extracting text from an extraction zone, there are additional properties available, "Value Extractor" and "Output Full Region"

The "Value Extractor" property allows you to write or reference a data extractor to extract data from only what is inside the extraction zone. This can be useful the further narrow the Anchored Extractor’s results. This property can be set to "Text Pattern" or "Reference"


1573235483933-738.png


The "Line Separator" property specifies how multiple lines of text are broken up. If you have a zone capturing multiple lines of text they will be concatenated, with one line immediately following the other with no space or break in between. You must specify a seperator here, such as a space or a comma or vertical bar ("|"), if you want the lines broken up (the regular expression characters \r \n \t \f and \s may also be used).


1573235683417-459.png


The "Output Full Region" property determines whether the entire extraction zone is highlighted or just the area containing the text extracted. Setting this to "True" can be very helpful when configuring the size and position of the extraction zone.

Seen below, the property is set to "False", highlighting only what was extracted.

Results in the Document View

1573235688765-905.png

Extraction results

1573236150179-639.png


Setting the property to "True" shows the entire zone. This way you can see what could be extracted if text fell within the field.


1573235025888-649.png


Property Details

Property Default Value Information
Page Number 1 Set the page number you want to extract from here. You may only enter one number. You can enter positive or negative page numbers. Positive numbers will start at the beginning of the document ("1" will be the first page, "2" the second, and so on).

Negative numbers will start at the end of the document ("-1" will be the last page, "-2" the second to last, and so on).

Region Of Interest The "Region of Interest" property controls the size and location of the extraction zone. Select the ellipsis button at the end of the property to bring up the "Edit Zones" window. From there, you can draw a rectangle on a page. All text falling within that rectangle will be extracted.
Auto Snap Disabled Enabling this property will create the extraction zone by automatically snapping to nearby lines around the anchor. When enabled, the maximum distance the operation will travel when looking for a line is controlled by the "Auto Snap Distance" property and the size of the resulting extraction zone can be shrunk using the "Auto Snap Margin" property.
Extract Mode None This controls how data is extracted from the zone. It can be one of the following modes:
  • None - No extraction is performed. A zone is simply placed on the page.
  • FullText - Existing text generated from a Recognize activity, either from OCR or text from a PDF, will be extracted from the zone.
  • OCR - Text falling in the zone will be reprocessed by a secondary OCR Profile.
Value Extractor (none) Optionally, a secondary data extractor can run on the results extracted from the extraction zone.  This can be useful the further narrow the Anchored Extractor’s results. This can be set to "Text Pattern" or "Reference".
Line Separator Determines how multiple lines of text are broken up. If left blank, line breaks will be concatenated, one added one after another with no character in between (For example, a space could be something you want to add as a way to break up one line from another).
Output Full Region False By default, any review module will highlight just the extracted text. Set Output Full Region to “True” to show the entire potential extraction zone. This can be useful when configuring the zone.