2.90:Read Zone (Value Extractor)

Read Zone allows you to extract text data in a rectangular region (called a "extraction zone" or just "zone") on a document. This can be a fixed zone, extracting text from the same location on a document, or a zone relative to an extracted text anchor or shape location on the document.

Read Zone is a Value Extractor option available to Data Fields in a Data Model. It is also an option for the Positive Extractor and Negative Extractor properties of a Document Type.

About

Highly structured documents organize information into a series of data fields. These fields will have a label identifying what the field contains, such as "Name", and a corresponding value, such as "John Doe". While the values for these fields will change from document to document, their position on the document will remain constant.

The Read Zone extractor extracts data using this feature of document layouts.

As long as you can be reasonably assured the data you want to find will be in the same spot from document to document, you don't necessarily need anything fancier than extracting whatever text is in that known location.

Read Zone populates data in Data Fields by drawing a rectangle on a location on a page. Whatever text was obtained from the Recognize activity (either via OCR or native text extraction) that falls within the boundaries of that rectangle (or "zone") populates the Data Field.

For the zone drawn on the document...	...the text data falling within that zone will be extracted.

Read Zone also has the capability to anchor this extraction zone to another location on the document. For example, due to issues with printing or scanning, the location of the value may shift from document to document. It's more than possible that zone could extract the data fine on one document but be slightly off on another.

The margins here are different from the document above...	...resulting in the wrong extracted data.

Several configuration options allow you to place the extraction zone relative to another piece of information. This serves as an "anchor" for the zone. Instead of a fixed position on all documents, the zone is placed relative to this anchor's position. For example, in this case the label "1. Last Name" could be an anchor. If you can pattern match that field label with regular expression, the zone you draw on the document will extract the value relative to that label's position.

Anchored off the field label...	...the zone falls on the right page location.

FYI

Read Zone is new to version 2.90. Similar functionality was performed by Zonal Extract and Anchored Extract in version 2.80 or using "Data Element Profiles" in older versions.

Use Cases

The Read Zone Value Extractor can be an effective way to extract data from highly structured documents. As long as the physical positions of a field label and its corresponding value is relatively fixed on a document, this can be a reliable way to get data out of your documents. In many cases, little and even no regular expression is required to pull information from your documents.

Read Zone can also be a great way to target fields where traditional Key-Value Pair approaches have difficulty due to poor document layouts. The traditional approach of using Key-Value Pair collated Data Types to understand Horizontal and Vertical relationships between the label and value can fail or produce undesirable results in certain cases where a form's designer chose to align the label to the top-left corner of a bounding box, while aligning the value to the bottom right corner of a bounding box. Read Zone using the Relative Region or Text Region options can still understand the relationship without having to worry about these formatting challenges.

Snapping to Lines

If the data you want falls within a bounding box, effectively encapsulated in a box on the page, you can leverage Grooper's image processing capabilities to use the line locations to fully extract all data within the line boundaries. This can make configuration of Read Zone much easier. If you can get place the initial extraction zone somewhere inside the larger box around it, the Auto Snap functionality will automatically expand the zone to fill the box's space.

In the case below, Read Zone could be configured to place the initial zone by locating the label "5. E-mail Address" and then expand the zone to the edges of the full box. This will extract the email address, and the label can be excluded from the returned data easily.

Before snapping to lines	After snapping to lines

For more information, visit the Auto Snap - Using Line Layout Data to Your Advantage section of the How To tutorials in this article.

OCR Reprocessing

Text inside the extraction zone can be reprocessed by a second OCR Profile. This is extremely useful on documents where the labels are easily extracted by one OCR Profile, but the values themselves are more accurately read by a different one. For example, one OCR engine may perform better on the font used to identify labels, but a second may do better at the one used for values. Grooper 2.80 and later comes installed with Transym and Tesseract OCR engines. Transym does a great job recognizing most fonts. However, it can do a poor job at recognizing the OCRA font. Tesseract has unique functionality to handle the OCRA font.

In the example below, the text reading "Wyatt" inside the extraction zone could be reprocessed by an OCR Profile using the Tesseract engine to accurately extract the name "Wyatt".

For more information, visit the Re-OCRing the Zone section of the How To tutorials in this article.

How To

If you wish to follow along with the tutorials in this section, you may download the zip file linked below and import it into your own Grooper Repository. For more information on importing Grooper objects into a Grooper Repository, visit the Import or Export Grooper Objects article.

Media:Read Zone (2.90).zip

Enable Read Zone

Read Zone is an option for the Value Extractor property of a Data Field.

To use this extractor, select a Data Field in a Data Model.
Select the Value Extractor property.
Choose Read Zone from the dropdown menu.

The Read Zone extractor has four Location property options. You must choose one of these options in order for Read Zone to function.

Expand the Read Zone sub-properties.
Choose your Location option.

Each one has slightly different functionality and configurations. The four Location options are as follows:

Fixed Region
Relative Region
Shape Region
Text Region

Each option is detailed in the How To sections below.

Fixed Region

The Fixed Region option is the simplest to set up. As the name implies, the extraction zone will be fixed on the page. It will stay in the same coordinates for every document. All you need to do is draw the box where you want to extract data.

Draw the Zone BoundsTest ExtractionA Word of Caution

Draw The Zone Bounds

Expand out the Location sub-properties and select the Bounds property.
Press the ellipsis button at the end.
This will bring up the "Edit Zone" window. Press the "Select Region" button, if it is not already selected.
With your mouse, draw a box around the text you want to select. Remember, any text falling inside of this box will be extracted. Any outside of the box will be missed. Make sure your box is the appropriate size to capture all field values for this document.

You will place a green box on the page. Any text falling within this box will be extracted. You can move the box around the page and use the transform controls on the corners and edges of the box to edit its width and height (as well as using the Left, Top, Width, and Height properties)

Press the "Ok" button when finished placing the zone.

Test Extraction

With a Document Folder selected, press the Test Extraction button to verify our results.

Success! The last name "Cleaugh" is extracted from the OCR text of this document.

Notice the green box around "Cleaugh" on the page only extends to the size of the text value extracted. When configuring Read Zone it can be useful to see the full size of the box you drew earlier. This can lead to some confusion as to what is or is not being extracted and why while testing your Read Zone configurations. That is why the Output Full Region property exists.

Turn the Output Full Region property to True.
Press the "Test Extraction" button again.
This changes absolutely nothing in terms of what data is extracted, but can be useful in your configuration testing. We will keep this property set to True for this example and the other Location option examples.

A Word Of Caution

Remember, the Fixed Region location's extraction zone stays in the same physical location on the page from document to document. If the text your trying to extract shifts locations due to scanning irregularities or a new document format, this method has the potential to extract the wrong data.

For example, take the two documents here. They are the same document, but one has very different margins than the other. While the registration zone we configured earlier falls on the last name on the left, it does not on the document on the right.

Whatever text falls within that extraction zone is extracted. As you can see, for the second document, the text "1. Last Name" is extracted, instead of "Cleugh".

If your documents are not totally uniform, and you're running into issues like this. You may want to explore the other Location options detailed in the tutorials below.

Relative Region

Instead of setting the extraction zone in a fixed location for every document, the Relative Region mode will anchor the zone to a text label on the document. Its position will change relative to the label's position on the document, but will still have the same drawn dimensions.

Set the Label ExtractorSet the Label LocationSet the Zone BoundsTest Extraction

Set the Label Extractor

The first thing you must do is return a label on the document with an extractor. This is the "anchor" for the extraction zone.

With the Relative Region method's sub-properties expanded, expand the Anchor properties.
Select the Label Extractor property. This extractor will locate the text in the document you want to anchor the extraction zone to.
- This can be a simple internal Text Pattern or a Reference to an extractor elsewhere in the Node Tree. In our case, the label can be returned with a simple regular expression. we have chosen Text Pattern

Regardless whether you use a Text Pattern or a Reference extractor, your goal will be the same. You want to return some field label identifying the value you want to extract. If we want to ultimately extract the first name from these documents, we know that name will be next to the label "2. First Name".

Even if the position of the text in that field changes due to irregularities like we've seen before, the value's location (here "Anissa") should stay more or less the the same relative to the label "2. First Name".

In this case, a simple regular expression in the Value Pattern will match that label's text.

2\. First Name

Note: It also produces matches on the second and third pages of this document. In our case, this will not matter. The relative distance from the label and the value is functionally the same on each page and produces the same result. And, the first value on the first page is what will be returned to our Data Model. However, be aware of this as a potential issue. You may need to narrow down your results to the proper label using various extraction techniques, such as a page filter.

Set the Label Location

Once the anchor's text is returned, Grooper has positional coordinates for the anchor. Next, you must set where the anchor's location is in relation to this anchor.

First, you muse set the Relative To property.
- By default, this set to TopLeft. This means, the anchor point will be the top left corner of the logical boundaries of the anchor extractor's result. You can change this to a variety of other positions, just as the bottom right corner or middle center of the result.
Select the Location property and press the ellipsis button at the end.
This will bring up the "Select Anchor Dialog" window. In the "Results" panel, you will see the list of results returned by your anchor extractor. Select the result whose position you want to use.
Notice there is a red dot in the top left corner of the anchor result selected. This is the relative anchor point that will be used for placing the extraction zone. It is in the top left corner of the result because we kept the Relative To property set to TopLeft".

Set the Zone Bounds

Next, just like for the Fixed Region mode, you must draw a box for the extraction zone.

Select the Bounds property and press the ellipsis button at the end.
This will bring up the "Edit Zone" window. Press the "Select Region" button, if it is not already selected.
With your mouse, draw a box around the text you want to select. Remember, any text falling inside of this box will be extracted. Any outside of the box will be missed. Make sure your box is the appropriate size to capture all field values for this document.
Press the "Ok" button when finished placing the zone.

Test Extraction

This is the bare minimum requirements for setting up the Relative Region mode's extraction. This will properly extract the First Name field for both the document with normal margins...

...and the one with the abnormal margins. The extraction zone is no longer a fixed location from document to document, but is placed relative the the text anchor's location. The anchor is outlined in blue on the document.

Note: The 'Output Full Region property here is set to True, displaying the full extraction zone in green on the document.

Text Region

The Text Region option creates an extraction zone using the logical boundaries of an extraction result. This can just return all the text falling within the boundaries of the rectangle around the extractor's result.

This can also be configured to provide results in a similar way the Relative Region option does, using text anchors located by an extractor to position the extraction zone's location. This means both methods can be used to position the zone relative to a point from document to document. The main difference is in how the zone is drawn.

Set the Text ExtractorAdjusting the Anchor Point of the ZoneAdjusting the Translation and Adjustment Properties

Set the Text Extractor

The first thing you must do is return a text label on the document with an extractor. This is the "anchor" for the extraction zone.

With the Text Region method's sub-properties expanded, select the Text Extractor property to set an extractor. This extractor's result will draw the box around the portion of the document you want to extract.
- This can be a simple internal Text Pattern or a Reference to an extractor elsewhere in the Node Tree.
- In our case, we can match what we want with a simple regular expression. We have chosen Text Pattern

Regardless whether you use a Text Pattern or a Reference extractor, our goal will be the same. For this tutorial, we want to return some field label identifying the value you want to extract. If we want to ultimately extract the date of birth value from these documents, we know the birthdate will be next to the label "4. Birth Date".

Even if the position of the text in that field changes due to irregularities like we've seen before, the value's location (here "3/27/95") should stay more or less the the same relative to the label "4. Birth Date".

In this case, a simple regular expression in the Value Pattern will match that label's text.

4\. Birth Date

If you press the "Test Extraction" button, you'll see where the Text Region option starts to differ from the Relative Region property.

With no further configuration, the extraction zone drawn is simply the boundaries of the Text Extractor result. So, we see "4. Birth Date" populating the field.

If we want to use this as a positional anchor to find the actual Birth Date "3/27/95", we must configure the Anchor Point, Translation, and Adjustment properties.

Adjusting the Anchor Point of the Zone

There are a variety of ways we could configure the extraction zone to extract what we want. First, we will look at the Anchor Point property and its effects on the extraction zone.

Similar to Relative Region we can make a relative anchor point from the text extractor's result.

Select the Anchor Point property and expand the dropdown list to select an anchor point. This will give us a position on the page from which the extraction zone can be drawn.
- We will choose TopLeft for this example.

This will give us two new properties: Move To and Size

The Move To property will move the extraction zone from whatever anchor point you selected to a new position within the text extractor's result boundaries.

Here, we have set it to BottomLeft
Upon testing extraction, you can see the extraction zone remains the same size as before, but has been moved. The original Anchor Point of TopLeft has been moved to the BottomLeft corner.

However, the zone is too small to extract anything.

The Size property allows you to alter the size of the extraction zone.

Expand the Size property and change the Width to 1in and Height to 0.5in
This will create a 1 inch by 0.5 inch extraction zone, starting at the anchor point's location.
The date now falls within that zone and is successfully extracted.

Adjusting the Translation and Adjustment Properties

You can accomplish the same goal using the Translation and Adjustment properties. These properties also manipulate the size and location of the extraction zone, just in different ways.

The Translation property will move the extraction zone (supplied by the text extractor) across the X and/or Y axis of the page. Logically, we want the extraction zone to be slighting below the label "4. Birth Date"

Expand the Translation properties and enter 0in for the X translation shift and 0.1in for the Y.
Upon testing extraction, you can see we moved the box 0.1 inches down along the Y axis and no inches to the left or right along the X axis.
- Positive values will move the down the Y axis and right along the X axis
- Negative values will move up the Y axis and left along the X axis.

⚠	Even if you do not want to shift the translation across either the X or Y axis, you must enter in a value of "0in" like we did above. You will get an error if you configure one axis but leave the other blank.

The Adjustment property will adjust the size of the extraction zone. Here, you can adjust the size of the Left, Right, Top, and Bottom edges of the extraction zone.

Expand the Adjustment properties and enter 0.5in for the Right adjustment and 0.25in for the Bottom.
Upon testing extraction, you can see we expanded the box's right boundary 0.5 inches and expanded the bottom boundary 0.25 inches.
- Positive values will expand a boundary. Negative values will shrink a boundary.
Now, since the extraction zone fully overlaps the date, it is extracted properly.

FYI

The main difference between the Adjustment property and the Size property seen when adjusting the Anchor Point is the Size property is an absolute change to the extraction zone's size where the Adjustment change is relative. So, if the size of the text extractor's result is larger on one document but smaller on another, it will adjust the extraction zone's size relative to the result on each document. However, the Size property normalizes the size of the extraction zone across all documents.

Shape Region

The Shape Region option is extremely similar to the Text Region option. However, instead of using text to anchor the extraction zone, it uses a shape detected from a Shape Detection or Shape Removal IP Command.

Prereqs - Layout Data CollectionAssign the Shape NameAdjust the Anchor, Location, and Size of the Zone

Prereqs - Layout Data Collection

In order to use a shape as the anchor point, you must first find the shape and save its location on the page. This can be done with a Shape Detection or Shape Removal IP command in an IP Profile. After applying that IP Profile during an Image Processing or Recognize activity, that data will be saved to the page's "LayoutData.json" file in Grooper.

You can see here the temporary IP Profile used during the Recognize activity for this document.

It has a Shape Detection IP Command as one of its IP Steps
It is configured to find the "ABC Company" logo on this document, using sample images of that logo.
- Note, we gave the logo a Shape Name of Logo. This will allow us to identify the shape as an anchor point.
The shape is detected.

Assign the Shape Name

First, you need to tell Grooper what shape you're looking for. This will match up with the Shape Name you assigned to the Shape Detection or Shape Removal IP Command in the IP Profile. We named the detected shape "Logo"

Type in the name of the shape you're detecting in the Shape Name property. Here, we entered Logo.
The extraction zone is drawn around the detected shape. Here, the ABC Company logo.
Text obtained by the Recognize activity falling in the zone is returned.
- Here, it's just the text OCR picked up from the logo.

Adjust the Anchor, Location, and Size of the Zone

Shape Region has the same set of properties available to adjust the anchor, location, and size of the extraction zone Text Region does: Anchor Point, Translation, and Adjustment.

For our case, we're going to use the detected shape as the anchor to find the address in the "Address Line 1" box. We don't need to use the Anchor Point property at all in this case. We just need to move the zone down and expand it out to the left a little.

The Translation properties will move the zone.
- Here, we set the X property to 0in and Y property to 1.4in
- Positive values will move the down the Y axis and right along the X axis
- Negative values will move up the Y axis and left along the X axis.

The Adjustment property will adjust the size of the zone.
- Here, we set the Left property to 1in.
- Positive values will expand a boundary. Negative values will shrink a boundary.
The zone now overlaps the text we want to extract entirely.
The text behind the zone is returned to the Data Field.

Auto Snap - Using Line Layout Data to Your Advantage

Many documents organize fields into cells, effectively placing the value you want to extract in a box. You can use this to your advantage by "snapping" the extraction zone to the lines around it. This can allow you to get close to the right zone first and then expand the zone's boundaries to the edges of the whole box.

Prereqs - Layout Data CollectionEstablish the Initial ZoneEnable Auto SnapOptionally Exclude the Anchor's Text

Prereqs - Layout Data Collection

In order to snap to lines, you must first find and save that line location information. This can be done with a Line Detection or Line Removal IP command in an IP Profile. After applying that IP Profile during an Image Processing or Recognize activity, that data will be saved to the page's "LayoutData.json" file in Grooper.

Establish the Initial Zone

All Location options (Fixed Region, Relative Region, Shape Region, and Text Region) have an Auto Snap property. Enabling this property will allow snapping the zone to lines surrounding it. However, first you must establish the initial extraction zone.

For this example, we will have set the Location property to Text Region. We have set the Text Extractor to a Text Pattern matching "5. E-mail Address"

You can see here, our zone returns the whole label.

Enable Auto Snap

Change the Auto Snap property from Disabled to Enabled.
Since we've obtained line location information saved on this page's "LayoutData.json" file, the extraction zone expands to the line boundaries. Each edge of the zone expands until a line is found, returning the contents of the entire cell.

Optionally Exclude the Anchor's Text

Notice since the entire zone's text is returned, that includes the anchor's result returned by the text extractor. Often, this is some kind of label, and you don't really want this result returned, as is the case here. That is what the Exclude Anchor property is for.
You can remove the text extracted by the anchor or text extractor by turning the Exclude Anchor property to True. This will delete the anchor's text from the result. Here, this leaves us with just the email address.

Re-OCRing the Zone

One use case for Read Zone is the ability to reprocess the text within the zone with a different OCR Profile than the one originally used during the Recognize activity.

Prereqs - Create a Secondary OCR ProfileEstablish the ZoneAssign the Secondary OCR Profile

The document used in this tutorial uses a specialized font for the field values, the OCR-A font. While this font was originally created with to OCR documents, modern OCR engines often have a hard time recognizing this font. However, the Tesseract engine has the capability to train fonts, allowing you to improve the OCR accuracy of non-standard fonts.

Training data for the OCR-A font ships with all Grooper installs (post version 2.72). Here we have a very simple secondary OCR Profile, using Tesseract OCR for the OCR Engine, and the OCRA font checked as a Special Fonts option.

The main OCR results were supplied by an OCR Profile using the Transym OCR 4.0 engine.

Any of the four Location options can perform another OCR pass on the extraction zone.

For this example, we've established our extraction zone using the Text Pattern Location option. The Text Extractor locates the text anchor "Docket Number" on the page. Auto Snap is Enabled to expand the boundaries of the zone to the edges of the box surrounding the anchor. The anchor's text is removed by enabling the Exclude Anchor property.

This should return the docket number "055-761349". However, the main OCR Profile, using Transym, did not recognize the number well, returning "055 - 7613 L 9".

Assign the Secondary OCR Profile

To reprocess this portion of the document with a different OCR Profile, use the OCR Profile property to assign a secondary OCR Profile.

Here, we have set the OCR Profile to the profile using the Tesseract OCR engine.
OCR runs on the portion of the document covered by the extraction zone.
The new OCR results are returned to the Data Field.
- In this case, the secondary OCR profile provides the accurate result.

Data Instancing: Using Read Zone's Value Extractor Property

While Read Zone is an option for a Data Field's Value Extractor property, you may have noticed Read Zone has its own Value Extractor sub-property! While the term "Value Extractor" usually pertains to an extractor supplying data values to a Data Field it can be a catch-all term for an extractor finding and returning a value period.

As a sub-property of Read Zone the Value Extractor property allows a Text Pattern or Reference extractor to parse data from only the extraction zone's data instance. Instead of matching and returning data against the whole document, this Value Extractor will only match and return data inside the extraction zone.

For example, this Data Field uses Read Zone to return the email address in the "5. E-mail Address" field of the document. We could use the 'Value Extractor of Read Zone to match only the local part of the email address (the "cfears5" in "cfears5@sitemeter.com" in this case).

The Value Extractor can be set to either Text Pattern or Reference. In this case, we chose Text Pattern and used a simple regex to match everything up to the "@" symbol: [^@]+

Notice, when we enter this pattern into the Value Pattern of the Pattern Editor, we do not match just the local part of that email address. Rather we get three results matching a great deal of the docuemnt.

That is because we are currently matching against the document's data instance. However, when it comes time to what this pattern actually matches it will execute against the zone's data instance, which in this case only contains the text "cfears5@sitemeter.com".

With this pattern set as a Text Pattern for the Value Extractor property...
Press the "Test Extraction" button.
Since this extractor only runs against the extraction zone's data instance, only the text inside that instance is matched, producing "cfears5"

Differences Between Read Zone and OCR Reader

In a lot of ways, Read Zone and the OCR Reader Post Processing option for Data Types are very similar to each other. Both can use text anchors and extraction zones to return data inside the drawn boundary on the page.

Both have a Value Extractor property to parse data within the zone's data instance as well. However, the results are collated and returned in one drastically different way.

Read Zone will return only the first match returned.
OCR Reader will return all matches concatenated together.

The differences are not necessarily good or bad. It just depends on your needs which one you will want to use.

For example, we can create a Data Type and set the 'Post Processing property to OCR Reader and configure it to produce an extraction zone covering the "5. E-mail Address" field, seen here. It returns the full email "cfears5@sitemeter.com" just like our example above.
Here, we configured the Value Extractor with the exact same Value Pattern as we used for Read Zone: `[^@]+` This pattern actually matches both the local part of the email address ("cfears5") and the domain (sitemeter.com). Instead of returning only the first result, OCR Reader returns all matches concatenated together, or in this case "cfears5sitemeter.com".
Just like many other places in Grooper, you may use the Value Separator property to insert a character (or several characters) between each result. For example, entering the pipe character (`\|`) here will create a pipe-delimited list of results. You can see here, this changes the output to "cfears5\|sitemeter.com"

Version Differences

Read Zone is a new Value Extractor option available to Data Fields in Grooper Version 2.90. In version 2.80, similar functionality could be achieved via the Anchored Extract and Zonal Extract options. In versions older than 2.80, similar functionality could be achieved using "Data Element Profiles" of Document Type objects.