2.90:Read Zone (Value Extractor)
Read Zone allows you to extract text data in a rectangular region (called a "extraction zone" or just "zone") on a document. This can be a fixed zone, extracting text from the same location on a document, or a zone relative to an extracted text anchor or shape location on the document.
Read Zone is a Value Extractor option available to Data Fields in a Data Model. It is also an option for the Positive Extractor and Negative Extractor properties of a Document Type.
About
|
Highly structured documents organize information into a series of data fields. These fields will have a label identifying what the field contains, such as "Name", and a corresponding value, such as "John Doe". While the values for these fields will change from document to document, their position on the document will remain constant. |
|
The Read Zone extractor extracts data using this feature of document layouts. As long as you can be reasonably assured the data you want to find will be in the same spot from document to document, you don't necessarily need anything fancier than extracting whatever text is in that known location. |
Read Zone populates data in Data Fields by drawing a rectangle on a location on a page. Whatever text was obtained from the Recognize activity (either via OCR or native text extraction) that falls within the boundaries of that rectangle (or "zone") populates the Data Field.
| For the zone drawn on the document... | ...the text data falling within that zone will be extracted. |
Read Zone also has the capability to anchor this extraction zone to another location on the document. For example, due to issues with printing or scanning, the location of the value may shift from document to document. It's more than possible that zone could extract the data fine on one document but be slightly off on another.
| The margins here are different from the document above... | ...resulting in the wrong extracted data. |
Several configuration options allow you to place the extraction zone relative to another piece of information. This serves as an "anchor" for the zone. Instead of a fixed position on all documents, the zone is placed relative to this anchor's position. For example, in this case the label "1. Last Name" could be an anchor. If you can pattern match that field label with regular expression, the zone you draw on the document will extract the value relative to that label's position.
| Anchored off the field label... | ...the zone falls on the right page location. |
| FYI | Read Zone is new to version 2.90. Similar functionality was performed by Zonal Extract and Anchored Extract in version 2.80 or using "Data Element Profiles" in older versions. |
Use Cases
The Read Zone Value Extractor can be an effective way to extract data from highly structured documents. As long as the physical positions of a field label and its corresponding value is relatively fixed on a document, this can be a reliable way to get data out of your documents. In many cases, little and even no regular expression is required to pull information from your documents.
Read Zone can also be a great way to target fields where traditional Key-Value Pair approaches have difficulty due to poor document layouts. The traditional approach of using Key-Value Pair collated Data Types to understand Horizontal and Vertical relationships between the label and value can fail or produce undesirable results in certain cases where a form's designer chose to align the label to the top-left corner of a bounding box, while aligning the value to the bottom right corner of a bounding box. Read Zone using the Relative Region or Text Region options can still understand the relationship without having to worry about these formatting challenges.
Snapping to Lines
If the data you want falls within a bounding box, effectively encapsulated in a box on the page, you can leverage Grooper's image processing capabilities to use the line locations to fully extract all data within the line boundaries. This can make configuration of Read Zone much easier. If you can get place the initial extraction zone somewhere inside the larger box around it, the Auto Snap functionality will automatically expand the zone to fill the box's space.
In the case below, Read Zone could be configured to place the initial zone by locating the label "5. E-mail Address" and then expand the zone to the edges of the full box. This will extract the email address, and the label can be excluded from the returned data easily.
| Before snapping to lines | After snapping to lines |
For more information, visit the Auto Snap - Using Line Layout Data to Your Advantage section of the How To tutorials in this article.
OCR Reprocessing
Text inside the extraction zone can be reprocessed by a second OCR Profile. This is extremely useful on documents where the labels are easily extracted by one OCR Profile, but the values themselves are more accurately read by a different one. For example, one OCR engine may perform better on the font used to identify labels, but a second may do better at the one used for values. Grooper 2.80 and later comes installed with Transym and Tesseract OCR engines. Transym does a great job recognizing most fonts. However, it can do a poor job at recognizing the OCRA font. Tesseract has unique functionality to handle the OCRA font.
In the example below, the text reading "Wyatt" inside the extraction zone could be reprocessed by an OCR Profile using the Tesseract engine to accurately extract the name "Wyatt".

For more information, visit the Re-OCRing the Zone section of the How To tutorials in this article.
How To
If you wish to follow along with the tutorials in this section, you may download the zip file linked below and import it into your own Grooper Repository. For more information on importing Grooper objects into a Grooper Repository, visit the Import or Export Grooper Objects article.
Enable Read Zone
|
Read Zone is an option for the Value Extractor property of a Data Field.
|
|
|
The Read Zone extractor has four Location property options. You must choose one of these options in order for Read Zone to function.
Each one has slightly different functionality and configurations. The four Location options are as follows:
Each option is detailed in the How To sections below. |
Fixed Region
The Fixed Region option is the simplest to set up. As the name implies, the extraction zone will be fixed on the page. It will stay in the same coordinates for every document. All you need to do is draw the box where you want to extract data.
Draw The Zone Bounds
|
|
|
You will place a green box on the page. Any text falling within this box will be extracted. You can move the box around the page and use the transform controls on the corners and edges of the box to edit its width and height (as well as using the Left, Top, Width, and Height properties)
|
Test Extraction
Success! The last name "Cleaugh" is extracted from the OCR text of this document.
|
|
|
A Word Of Caution
|
Remember, the Fixed Region location's extraction zone stays in the same physical location on the page from document to document. If the text your trying to extract shifts locations due to scanning irregularities or a new document format, this method has the potential to extract the wrong data. For example, take the two documents here. They are the same document, but one has very different margins than the other. While the registration zone we configured earlier falls on the last name on the left, it does not on the document on the right. |
|
|
Whatever text falls within that extraction zone is extracted. As you can see, for the second document, the text "1. Last Name" is extracted, instead of "Cleugh". If your documents are not totally uniform, and you're running into issues like this. You may want to explore the other Location options detailed in the tutorials below. |
Relative Region
Instead of setting the extraction zone in a fixed location for every document, the Relative Region mode will anchor the zone to a text label on the document. Its position will change relative to the label's position on the document, but will still have the same drawn dimensions.
Set the Label Extractor
|
The first thing you must do is return a label on the document with an extractor. This is the "anchor" for the extraction zone.
|
|
|
Regardless whether you use a Text Pattern or a Reference extractor, your goal will be the same. You want to return some field label identifying the value you want to extract. If we want to ultimately extract the first name from these documents, we know that name will be next to the label "2. First Name". Even if the position of the text in that field changes due to irregularities like we've seen before, the value's location (here "Anissa") should stay more or less the the same relative to the label "2. First Name". In this case, a simple regular expression in the Value Pattern will match that label's text.
Note: It also produces matches on the second and third pages of this document. In our case, this will not matter. The relative distance from the label and the value is functionally the same on each page and produces the same result. And, the first value on the first page is what will be returned to our Data Model. However, be aware of this as a potential issue. You may need to narrow down your results to the proper label using various extraction techniques, such as a page filter. |
Set the Label Location
|
Once the anchor's text is returned, Grooper has positional coordinates for the anchor. Next, you must set where the anchor's location is in relation to this anchor.
|
Set the Zone Bounds
|
Next, just like for the Fixed Region mode, you must draw a box for the extraction zone.
|
Test Extraction
|
This is the bare minimum requirements for setting up the Relative Region mode's extraction. This will properly extract the First Name field for both the document with normal margins... |
|
|
...and the one with the abnormal margins. The extraction zone is no longer a fixed location from document to document, but is placed relative the the text anchor's location. The anchor is outlined in blue on the document. Note: The 'Output Full Region property here is set to True, displaying the full extraction zone in green on the document. |
Text Region
The Text Region option creates an extraction zone using the logical boundaries of an extraction result. This can just return all the text falling within the boundaries of the rectangle around the extractor's result.
This can also be configured to provide results in a similar way the Relative Region option does, using text anchors located by an extractor to position the extraction zone's location. This means both methods can be used to position the zone relative to a point from document to document. The main difference is in how the zone is drawn.
Set the Text Extractor
|
The first thing you must do is return a text label on the document with an extractor. This is the "anchor" for the extraction zone.
|
|
|
Regardless whether you use a Text Pattern or a Reference extractor, our goal will be the same. For this tutorial, we want to return some field label identifying the value you want to extract. If we want to ultimately extract the date of birth value from these documents, we know the birthdate will be next to the label "4. Birth Date". Even if the position of the text in that field changes due to irregularities like we've seen before, the value's location (here "3/27/95") should stay more or less the the same relative to the label "4. Birth Date". In this case, a simple regular expression in the Value Pattern will match that label's text.
|
|
|
If you press the "Test Extraction" button, you'll see where the Text Region option starts to differ from the Relative Region property. With no further configuration, the extraction zone drawn is simply the boundaries of the Text Extractor result. So, we see "4. Birth Date" populating the field. If we want to use this as a positional anchor to find the actual Birth Date "3/27/95", we must configure the Anchor Point, Translation, and Adjustment properties. |
Adjusting the Anchor Point of the Zone
|
There are a variety of ways we could configure the extraction zone to extract what we want. First, we will look at the Anchor Point property and its effects on the extraction zone. Similar to Relative Region we can make a relative anchor point from the text extractor's result.
|
|
|
This will give us two new properties: Move To and Size The Move To property will move the extraction zone from whatever anchor point you selected to a new position within the text extractor's result boundaries.
However, the zone is too small to extract anything. |
|
|
The Size property allows you to alter the size of the extraction zone.
|
Adjusting the Translation and Adjustment Properties
|
You can accomplish the same goal using the Translation and Adjustment properties. These properties also manipulate the size and location of the extraction zone, just in different ways. The Translation property will move the extraction zone (supplied by the text extractor) across the X and/or Y axis of the page. Logically, we want the extraction zone to be slighting below the label "4. Birth Date"
|
|||
|
The Adjustment property will adjust the size of the extraction zone. Here, you can adjust the size of the Left, Right, Top, and Bottom edges of the extraction zone.
|
Shape Region
The Shape Region option is extremely similar to the Text Region option. However, instead of using text to anchor the extraction zone, it uses a shape detected from a Shape Detection or Shape Removal IP Command.
Prereqs - Layout Data Collection
|
In order to use a shape as the anchor point, you must first find the shape and save its location on the page. This can be done with a Shape Detection or Shape Removal IP command in an IP Profile. After applying that IP Profile during an Image Processing or Recognize activity, that data will be saved to the page's "LayoutData.json" file in Grooper. You can see here the temporary IP Profile used during the Recognize activity for this document.
|
Assign the Shape Name
|
First, you need to tell Grooper what shape you're looking for. This will match up with the Shape Name you assigned to the Shape Detection or Shape Removal IP Command in the IP Profile. We named the detected shape "Logo"
|
Adjust the Anchor, Location, and Size of the Zone
|
Shape Region has the same set of properties available to adjust the anchor, location, and size of the extraction zone Text Region does: Anchor Point, Translation, and Adjustment. For our case, we're going to use the detected shape as the anchor to find the address in the "Address Line 1" box. We don't need to use the Anchor Point property at all in this case. We just need to move the zone down and expand it out to the left a little.
|
|
|
Auto Snap - Using Line Layout Data to Your Advantage
Many documents organize fields into cells, effectively placing the value you want to extract in a box. You can use this to your advantage by "snapping" the extraction zone to the lines around it. This can allow you to get close to the right zone first and then expand the zone's boundaries to the edges of the whole box.
Prereqs - Layout Data Collection
In order to snap to lines, you must first find and save that line location information. This can be done with a Line Detection or Line Removal IP command in an IP Profile. After applying that IP Profile during an Image Processing or Recognize activity, that data will be saved to the page's "LayoutData.json" file in Grooper.
Establish the Initial Zone
|
All Location options (Fixed Region, Relative Region, Shape Region, and Text Region) have an Auto Snap property. Enabling this property will allow snapping the zone to lines surrounding it. However, first you must establish the initial extraction zone. For this example, we will have set the Location property to Text Region. We have set the Text Extractor to a Text Pattern matching "5. E-mail Address" You can see here, our zone returns the whole label. |
Enable Auto Snap
|
Optionally Exclude the Anchor's Text
|
|
|
Re-OCRing the Zone
One use case for Read Zone is the ability to reprocess the text within the zone with a different OCR Profile than the one originally used during the Recognize activity.
|
The document used in this tutorial uses a specialized font for the field values, the OCR-A font. While this font was originally created with to OCR documents, modern OCR engines often have a hard time recognizing this font. However, the Tesseract engine has the capability to train fonts, allowing you to improve the OCR accuracy of non-standard fonts. Training data for the OCR-A font ships with all Grooper installs (post version 2.72). Here we have a very simple secondary OCR Profile, using Tesseract OCR for the OCR Engine, and the OCRA font checked as a Special Fonts option.
|
|
Any of the four Location options can perform another OCR pass on the extraction zone. For this example, we've established our extraction zone using the Text Pattern Location option. The Text Extractor locates the text anchor "Docket Number" on the page. Auto Snap is Enabled to expand the boundaries of the zone to the edges of the box surrounding the anchor. The anchor's text is removed by enabling the Exclude Anchor property. This should return the docket number "055-761349". However, the main OCR Profile, using Transym, did not recognize the number well, returning "055 - 7613 L 9". |
Assign the Secondary OCR Profile
|
To reprocess this portion of the document with a different OCR Profile, use the OCR Profile property to assign a secondary OCR Profile.
|
Data Instancing: Using Read Zone's Value Extractor Property
While Read Zone is an option for a Data Field's Value Extractor property, you may have noticed Read Zone has its own Value Extractor sub-property! While the term "Value Extractor" usually pertains to an extractor supplying data values to a Data Field it can be a catch-all term for an extractor finding and returning a value period.
|
As a sub-property of Read Zone the Value Extractor property allows a Text Pattern or Reference extractor to parse data from only the extraction zone's data instance. Instead of matching and returning data against the whole document, this Value Extractor will only match and return data inside the extraction zone. For example, this Data Field uses Read Zone to return the email address in the "5. E-mail Address" field of the document. We could use the 'Value Extractor of Read Zone to match only the local part of the email address (the "cfears5" in "cfears5@sitemeter.com" in this case). |
|
|
The Value Extractor can be set to either Text Pattern or Reference. In this case, we chose Text Pattern and used a simple regex to match everything up to the "@" symbol: Notice, when we enter this pattern into the Value Pattern of the Pattern Editor, we do not match just the local part of that email address. Rather we get three results matching a great deal of the docuemnt. That is because we are currently matching against the document's data instance. However, when it comes time to what this pattern actually matches it will execute against the zone's data instance, which in this case only contains the text "cfears5@sitemeter.com". |
|
|
Differences Between Read Zone and OCR Reader
In a lot of ways, Read Zone and the OCR Reader Post Processing option for Data Types are very similar to each other. Both can use text anchors and extraction zones to return data inside the drawn boundary on the page.
Both have a Value Extractor property to parse data within the zone's data instance as well. However, the results are collated and returned in one drastically different way.
- Read Zone will return only the first match returned.
- OCR Reader will return all matches concatenated together.
The differences are not necessarily good or bad. It just depends on your needs which one you will want to use.
|
For example, we can create a Data Type and set the 'Post Processing property to OCR Reader and configure it to produce an extraction zone covering the "5. E-mail Address" field, seen here. It returns the full email "cfears5@sitemeter.com" just like our example above. |
|
|
|
|
Version Differences
Read Zone is a new Value Extractor option available to Data Fields in Grooper Version 2.90. In version 2.80, similar functionality could be achieved via the Anchored Extract and Zonal Extract options. In versions older than 2.80, similar functionality could be achieved using "Data Element Profiles" of Document Type objects.














































