2.90:Read Zone (Value Extractor)
Read Zone allows you to extract text data in a rectangular region (called a "extraction zone" or just "zone") on a document. This can be a fixed zone, extracting text from the same location on a document, or a zone relative to an extracted text anchor or shape location on the document.
Read Zone is a Value Extractor option available to Data Fields in a Data Model.
About
|
Highly structured documents organize information into a series of data fields. These fields will have a label identifying what the field contains, such as "Name", and a corresponding value, such as "John Doe". While the values for these fields will change from document to document, their position on the document will remain constant. |
|
The Read Zone extractor extracts data using this feature of document layouts. As long as you can be reasonably assured the data you want to find will be in the same spot from document to document, you don't necessarily need anything fancier than extracting whatever text is in that known location. |
Read Zone populates data in Data Fields by drawing a rectangle on a location on a page. Whatever text was obtained from the Recognize activity (either via OCR or native text extraction) that falls within the boundaries of that rectangle (or "zone") populates the Data Field.
| For the zone drawn on the document... | ...the text data falling within that zone will be extracted. |
Read Zone also has the capability to anchor this extraction zone to another location on the document. For example, due to issues with printing or scanning, the location of the value may shift from document to document. It's more than possible that zone could extract the data fine on one document but be slightly off on another.
| The margins here are different from the document above... | ...resulting in the wrong extracted data. |
Several configuration options allow you to place the extraction zone relative to another piece of information. This serves as an "anchor" for the zone. Instead of a fixed position on all documents, the zone is placed relative to this anchor's position. For example, in this case the label "1. Last Name" could be an anchor. If you can pattern match that field label with regular expression, the zone you draw on the document will extract the value relative to that label's position.
| Anchored off the field label... | ...the zone falls on the right page location. |
| FYI | Read Zone is new to version 2.90. Similar functionality was performed by Zonal Extract and Anchored Extract in version 2.80 or using "Data Element Profiles" in older versions. |
How To
Enable Read Zone
|
Read Zone is an option for the Value Extractor property of a Data Field.
|
|
|
The Read Zone extractor has four Location property options. You must choose one of these options in order for Read Zone to function.
Each one has slightly different functionality and configurations. The four Location options are as follows:
Each option is detailed in the How To sections below. |
Fixed Region
The Fixed Region option is the simplest to set up. As the name implies, the extraction zone will be fixed on the page. It will stay in the same coordinates for every document. All you need to do is draw the box where you want to extract data.
Draw The Zone Bounds
|
|
|
You will place a green box on the page. Any text falling within this box will be extracted. You can move the box around the page and use the transform controls on the corners and edges of the box to edit its width and height (as well as using the Left, Top, Width, and Height properties)
|
Test Extraction
Success! The last name "Cleaugh" is extracted from the OCR text of this document.
|
|
|
A Word Of Caution
|
Remember, the Fixed Region location's extraction zone stays in the same physical location on the page from document to document. If the text your trying to extract shifts locations due to scanning irregularities or a new document format, this method has the potential to extract the wrong data. For example, take the two documents here. They are the same document, but one has very different margins than the other. While the registration zone we configured earlier falls on the last name on the left, it does not on the document on the right. |
|
|
Whatever text falls within that extraction zone is extracted. As you can see, for the second document, the text "1. Last Name" is extracted, instead of "Cleugh". If your documents are not totally uniform, and you're running into issues like this. You may want to explore the other Location options detailed in the tutorials below. |
Relative Region
Instead of setting the extraction zone in a fixed location for every document, the Relative Region mode will anchor the zone to a text label on the document. Its position will change relative to the label's position on the document, but will still have the same drawn dimensions.
Set the Label Extractor
|
The first thing you must do is return a label on the document with an extractor. This is the "anchor" for the extraction zone.
|
|
|
Regardless whether you use a Text Pattern or a Reference extractor, your goal will be the same. You want to return some field label identifying the value you want to extract. If we want to ultimately extract the first name from these documents, we know that name will be next to the label "2. First Name". Even if the position of the text in that field changes due to irregularities like we've seen before, the value's location (here "Anissa") should stay more or less the the same relative to the label "2. First Name". In this case, a simple regular expression in the Value Pattern will match that label's text.
Note: It also produces matches on the second and third pages of this document. In our case, this will not matter. The relative distance from the label and the value is functionally the same on each page and produces the same result. And, the first value on the first page is what will be returned to our Data Model. However, be aware of this as a potential issue. You may need to narrow down your results to the proper label using various extraction techniques, such as a page filter. |
Set the Label Location
|
Once the anchor's text is returned, Grooper has positional coordinates for the anchor. Next, you must set where the anchor's location is in relation to this anchor.
|
Set the Zone Bounds
|
Next, just like for the Fixed Region mode, you must draw a box for the extraction zone.
|
Test Extraction
|
This is the bare minimum requirements for setting up the Relative Region mode's extraction. This will properly extract the First Name field for both the document with normal margins... |
|
|
...and the one with the abnormal margins. The extraction zone is no longer a fixed location from document to document, but is placed relative the the text anchor's location. The anchor is outlined in blue on the document. Note: The 'Output Full Region property here is set to True, displaying the full extraction zone in green on the document. |
Shape Region
Text Region
The Text Region option creates an extraction zone using the logical boundaries of an extraction result. This can just return all the text falling within the boundaries of the rectangle around the extractor's result.
This can also be configured to provide results in a similar way the Relative Region option does, using text anchors located by an extractor to position the extraction zone's location. This means both methods can be used to position the zone relative to a point from document to document. The main difference is in how the zone is drawn.
Set the Text Extractor
|
The first thing you must do is return a text label on the document with an extractor. This is the "anchor" for the extraction zone.
|
|
|
Regardless whether you use a Text Pattern or a Reference extractor, our goal will be the same. For this tutorial, we want to return some field label identifying the value you want to extract. If we want to ultimately extract the date of birth value from these documents, we know the birthdate will be next to the label "4. Birth Date". Even if the position of the text in that field changes due to irregularities like we've seen before, the value's location (here "3/27/95") should stay more or less the the same relative to the label "4. Birth Date". In this case, a simple regular expression in the Value Pattern will match that label's text.
|
|
|
If you press the "Test Extraction" button, you'll see where the Text Region option starts to differ from the Relative Region property. With no further configuration, the extraction zone drawn is simply the boundaries of the Text Extractor result. So, we see "4. Birth Date" populating the field. If we want to use this as a positional anchor to find the actual Birth Date "3/27/95", we must configure the Anchor Point, Translation, and Adjustment properties. |
Adjusting the Anchor Point of the Zone
|
There are a variety of ways we could configure the extraction zone to extract what we want. First, we will look at the Anchor Point property and its effects on the extraction zone. Similar to Relative Region we can make a relative anchor point from the text extractor's result.
|
|
|
This will give us two new properties: Move To and Size The Move To property will move the extraction zone from whatever anchor point you selected to a new position within the text extractor's result boundaries.
However, the zone is too small to extract anything. |
|
|
The Size property allows you to alter the size of the extraction zone.
|
Adjusting the Translation and Adjustment Properties
|
You can accomplish the same goal using the Translation and Adjustment properties. These properties also manipulate the size and location of the extraction zone, just in different ways. The Translation property will move the extraction zone (supplied by the text extractor) across the X and/or Y axis of the page. Logically, we want the extraction zone to be slighting below the label "4. Birth Date"
|
|||
|
The Adjustment property will adjust the size of the extraction zone. Here, you can adjust the size of the Left, Right, Top, and Bottom edges of the extraction zone.
|
Auto Snap - Using Line Layout Data to Your Advantage
Re-OCRing the Zone
Version Differences
Read Zone is a new Value Extractor option available to Data Fields in Grooper Version 2.90. In version 2.80, similar functionality could be achieved via the Anchored Extract and Zonal Extract options. In versions older than 2.80, similar functionality could be achieved using "Data Element Profiles" of Document Type objects.





























