Anchored Extract

From Grooper Wiki
Jump to navigation Jump to search

Anchored Extract extracts text in a rectangular region (often referred to as a "zone" or "extraction zone") near a text label (referred to as the "anchor")

For Anchored Extract, labels on a document are used to position a rectangular zone around text to be extracted on a document. The extracted data is "anchored" to some portion of the document extracted by an "Anchor Extractor". The boxes size and position are set relative to the location of the anchor.

About

Anchored Extract was replaced by the Read Zone extractor in version 2.90. The same results can be achieved with slightly different configuration of this Value Extractor. Please visit the Read Zone article for more information if you're trying to perform Anchored Extract in version 2.90.


1572969733810-490.png

In this example, the Anchor Extractor found the label "Collection Date" on the page. The anchor label (Collection Date) is outlined in blue. The green box is the extraction zone. Any text falling in that region is extracted. The zone's height, width and relative location to the anchor are all editable properties to get the correct data to fall into place. In this case the date "06/12/85" will be the final output for this field.

Anchored Extract is especially useful for structured forms where information resides in labeled boxes. In these cases, the document's Layout Data (detected through an Image Processing command) can be used to position the extraction zone to the nearest borders around the anchor. As seen in the examples below, the extraction zone automatically "snaps" to the nearest box around the anchor label.

1572970644825-434.png

Anchored Extract also has the capability of excluding the anchor's text from the result. So, even though the extraction zone contains both the date and the anchor, the final output would still be just "06/12/85"

1572970610483-599.png

You can also alter the size of the snapped box by controlling the boxes maximum snap distance and changing the zone's margins.


Version Differences

Prior to 2.80, this capability was available by setting up a Data Element Profile in a Document Type object. The Anchored Extract extractor provides both a simpler setup and more robust configurations.  Rather than setting it up on individual Document Types, you only need to set it as the Value Extractor on a Data Field or Data Column in a Data Model.

Use Cases

Use Case 1

Any structured form where you can reliably count on extracted data to be in the same relative position to a label can take advantage of Anchored Extract.  The extractor is doubly useful for data inside tables or other boxed in regions, taking advantage of "snapping" the extraction zone to the boxed in borders. The traditional approach of using Key-Value Pairs to understand Horizontal and Vertical relationships between the label and value will fail in cases where a form's designer chose to align the label to the top-left corner of a bounding box, while aligning the value to the bottom right corner of a bounding box. The Anchored Extract technique will still understand the relationship without having to worry about these formatting challenges.

Use Case 2

Text inside the extraction zone can be reprocessed by a second OCR Profile.  This is extremely useful on documents where the labels are easily extracted by one OCR Profile, but the values themselves are more accurately read by a different one.  For example, one OCR engine may perform better on one font, but a second may do better at another.  Grooper 2.80 comes installed with Transym and Tesseract OCR engines.  Transym does a great job recognizing most fonts.  However, it can do a poor job at recognizing the OCRA font.  Tesseract has unique functionality to handle the OCRA font.

In the example below, the text reading "Wyatt" inside the extraction zone could be reprocessed by an OCR Profile using the Tesseract engine to accurately extract the name "Wyatt".

1572984549501-392.png

How To:  Configure the extractor

Before you begin

Anchored Extract is an extraction method to populate Data Fields or Data Columns in a Data Model. As such, you will need to create a Content Model with a Data Model with a Data Field or Data Column data element.

The anchor is found using a data extractor. In order to find a value, you will need OCR text or extracted native text from a PDF. To do this, run your document through the Recognize activity.

If you want to snap the extraction region within the borders of a box, you must obtain the document’s Layout Data. To do this, run a Line Detection or Line Removal command on an IP Profile during a Recognize or Image Processing activity. These activities must be run on the Page level.

Set the Value Extractor to Anchored Extract

Navigate to the Data Field or Data Column you wish to populate in your Data Model. Select the "Value Extractor" property.

From the dropdown list, choose "Anchored Extract".


1572974111013-607.png


Set the Anchor Extractor

1. Expand the "Value Extractor" property by double clicking it to reveal the Anchor Extractor's properties. Select the "Anchor Extractor" property. The value this extractor returns will be the “anchor.” The extraction zone will be placed relative to the anchor's physical location on the page. You may choose "Text Pattern" or "Reference" from the dropdown menu.


1572976508566-438.png


2a. Choose "Text Pattern" to use the Pattern Editor to construct a simple extractor local to the data element. Select the ellipsis button at the end of the "Pattern" property to bring up the Pattern Editor.


1572976624954-880.png


2b. Or, you may use a Data Type or Field Class extractor in the Node Tree by choosing "Reference". Selecting the "Extractor" property, use the dropdown menu to point to the extractor's location in the Node Tree. //In this example, the extractor used was titled "Anchor Label" and located as a child of the Data Field "Anchored Extract Field"


1572975851275-771.png


3. You may want to exclude the anchor's value from the result. To do so, change the "Exclude Anchor" property from "False" to "True".


1572976364202-986.png


Seen below, including the anchor's text may produce undesirable results


1572976292595-892.png


Here is the same extractor with the "Exclude Anchor" property enabled.


1572976308709-578.png


Configure the extraction zone - Offset Region

You have two methods of drawing the rectangular region around text you want to extract, "Offset Region" or "Auto Snap"

Use the "Offset Region" property to specify a rectangular zone relative to the position of the anchor. Double click the "Offset Region" to expand its properties.


1572977144828-545.png


Use the "Left" and "Top" properties to specify location. The zone created starts at the top left corner of the anchor's location. If you typed "0in" for both the Left and Top properties, the zone would be placed directly on the anchor. If you put "1in" for the Left and Top properties, the left edge would move 1 inch to the right and the top edge would move 1 inch down. In other words, the zone would move 1 inch to the right and 1 inch down. Negative values would have the opposite effect. Entering "-1in" for the Top and Left properties would cause the zone to move 1 inch to the right and 1 inch up.

The "Width" and "Height" properties are more straightforward. They control the size of the rectangular zone. So, entering "1in" for these properties would make the zone a square 1 inch wide by 1 inch high.

Examine the values entered in the example below.


1572978598717-305.png


The resulting zone is seen below in green. It is placed 0.7 inches to the right of the upper-left corner of the anchor. The top side is placed 0 inches from the upper-left corner. So, it stays inline horizontally with the anchor. The size of the extraction zone is set to 1.5 inches wide by 0.5 inches high. So, the green box is 1.5 inches long and 0.5 inches tall.


1572978613137-360.png


Configure the extraction zone - Auto Snap

The second method available to create the extraction zone is "Auto Snap". This will create the rectangular region by automatically snapping to nearby lines around the anchor. Select "Auto Snap" and change the property to "Enabled" to use this feature.


1572978967763-199.png


You can control the zone's size and behavior using the "Auto Snap Distance" and "Auto Snap Margin" properties.


1572981374839-492.png


"Auto Snap Distance" determines the maximum distance from the anchor the edges of the zone can snap to. This can be useful when dealing with documents with poor image quality where the line locations are not reliably known. Imagine an example like the one below. In this example, we're trying to use Anchored Extract to pull the last name off a set of documents. Here's how the zone should be snapping to nearby lines.


1572981180387-797.png


Let's say in 100% of documents structured like this, Grooper's Line Detection found the vertical line before the "3. Middle Name" box, but only on 90% of the documents did it find the vertical line before the "2. First Name" box. Without defining a maximum snap distance, in those 10% where it didn't find the line, it would snap all the way to the second vertical line, and the contents of both the "1. Last Name" box and "2. Middle Name" would be extracted. The zone would look as seen in the image below.


1572981074150-815.png


"Auto Snap Margin" shrinks the size of the zone after it has snapped to detected lines. This way, you can limit what is captured by the zone. The example below uses Auto Snap Margin to ignore the anchor from the result. The "Left" property was changed to "0.7in" to shrink the left side by 0.7 inches.


1572981658402-863.png


Set the Extract Mode

The default value for the "Extract Mode" property is "None". This will perform no extraction and simply place a zone on the page.

If you want to extract text from the document, change "Extract Mode" to "Full Text" or "OCR"


1572984693165-558.png


"Full Text" will extract existing text from the document obtained from the Recognize activity. This is the most commonly used method to extract text from the zone. If you already have OCR'd text or native text from a PDF for your documents, this will likely be the mode you choose.


1572984701397-285.png


"OCR" will allow you to set an OCR profile to reprocess OCR on the text in the extraction zone. This can be useful for documents where a document's labels are easily found by one OCR profile, but the actual value is read better by a different profile. It will overwrite the existing text on your documents from the Recognize activity. You will need to set the OCR Profile to be used on the "OCR Profile" property.


1572984707541-705.png


In the example below, the document ran through Recognize with an OCR Profile using the Transym OCR engine to accurately find the anchor "1. First Name". But when ran on "Full Text" mode the result was "bly a t t". This is because "Wyatt" is in the OCR-A font and Transym does not handle that font well. When ran on "OCR" mode and an OCR Profile using the Tesseract engine was used, "Wyatt" was extracted correctly. Tesseract has added functionality to handle the OCR-A font. This way both profiles and both OCR engines capabilities can be used to get the most accurate data off the document.


1572984549501-392.png


Remaining properties for FullText and OCR extraction modes

When extracting text from an extraction zone, there are additional properties available, "Value Extractor" and "Output Full Region"

The "Value Extractor" property allows you to write or reference a data extractor to extract data from only what is inside the extraction zone. This can be useful the further narrow the Anchored Extractor’s results. This property can be set to "Text Pattern" or "Reference"


1572985955015-323.png


The "Line Separator" property specifies how multiple lines of text are broken up. If you have a zone with multiple lines of text, like the example below, they will be returned one line immediately following the other ("Benjamin FinjaminCrigamin Flimfam" in this case). You must specify a separator here, such as a space or a comma, if you want the lines broken up (the regular expression characters \r \n \t \f and \s may also be used).


1572987108863-974.png


The "Output Full Region" property determines whether the entire extraction zone is highlighted or just the area containing the text extracted. Setting this to "True" can be very helpful when configuring the size and position of the extraction zone.

Seen below, the property is set to "False", highlighting only what was extracted.


1572986363919-148.png


Setting the property to "True" shows the entire zone. This way you can see what //could// be extracted if text fell within the field.


1572984549501-392.png


Property Details

Property Default Value Information
Anchor Extractor (none) The value this extractor returns will be the “anchor.” The extraction zone will be placed relative to the anchor's physical location on the page. You may choose "Text Pattern" or "Reference" from the dropdown menu.
Exclude Anchor False The anchor can be optionally excluded from the Anchored Extractor’s results. Set the Exclude Anchor property to “True” to remove the text returned by the Anchor Extractor from the results.
Offset Region Expand the "Offset Region" property to specify a rectangular extraction zone relative to the position of the anchor. Control the zone's position using the "Left" and "Top" properties. Control its size using the "Width" and "Height" properties.
Auto Snap Disabled Enabling this property will create the rectangular region by automatically snapping to nearby lines around the anchor. When enabled, the maximum distance the operation will travel when looking for a line is controlled by the "Auto Snap Distance" property and the size of the resulting extraction zone can be shrunk using the "Auto Snap Margin" property.
Extract Mode None This controls how data is extracted from the zone. It can be one of the following modes:
  • None - No extraction is performed. A zone is simply placed on the page.
  • FullText - Existing text generated from a Recognize activity, either from OCR or text from a PDF, will be extracted from the zone. This is the most commonly used method to extract text from the zone. If you already have OCR'd text or native text from a PDF for your documents, this will likely be the mode you choose.
  • OCR - Text falling in the zone will be reprocessed by a secondary OCR Profile. Existing text from a Recognize activity will be overwritten, using the OCR Profile you specify.
Value Extractor (none) Optionally, a secondary data extractor can run on the results extracted from the extraction zone.  This can be useful the further narrow the Anchored Extractor’s results. This can be set to "Text Pattern" or "Reference"
Line Separator Determines how multiple lines of text are broken up. If left blank, line breaks will be concatenated, one added one after another with no character in between (For example, a space could be something you want to add as a way to break up one line from another)
Output Full Region False By default, any review module will highlight just the extracted text. Set Output Full Region to “True” to show the entire potential extraction zone. This can be useful when configuring the zone.