2.90:Read Zone (Value Extractor): Difference between revisions

Revision as of 08:38, 12 October 2020

Read Zone allows you to extract text data in a rectangular region (called a "extraction zone" or just "zone") on a document. This can be a fixed zone, extracting text from the same location on a document, or a zone relative to an extracted text anchor or shape location on the document.

Read Zone is a Value Extractor option available to Data Fields in a Data Model.

About

Highly structured documents organize information into a series of data fields. These fields will have a label identifying what the field contains, such as "Name", and a corresponding value, such as "John Doe". While the values for these fields will change from document to document, their position on the document will remain constant.

The Read Zone extractor extracts data using this feature of document layouts.

As long as you can be reasonably assured the data you want to find will be in the same spot from document to document, you don't necessarily need anything fancier than extracting whatever text is in that known location.

Read Zone populates data in Data Fields by drawing a rectangle on a location on a page. Whatever text was obtained from the Recognize activity (either via OCR or native text extraction) that falls within the boundaries of that rectangle (or "zone") populates the Data Field.

For the zone drawn on the document...	...the text data falling within that zone will be extracted.

Read Zone also has the capability to anchor this extraction zone to another location on the document. For example, due to issues with printing or scanning, the location of the value may shift from document to document. It's more than possible that zone could extract the data fine on one document but be slightly off on another.

The margins here are different from the document above...	...resulting in the wrong extracted data.

Several configuration options allow you to place the extraction zone relative to another piece of information. This serves as an "anchor" for the zone. Instead of a fixed position on all documents, the zone is placed relative to this anchor's position. For example, in this case the label "1. Last Name" could be an anchor. If you can pattern match that field label with regular expression, the zone you draw on the document will extract the value relative to that label's position.

Anchored off the field label...	...the zone falls on the right page location.

FYI

Read Zone is new to version 2.90. Similar functionality was performed by Zonal Extract and Anchored Extract in version 2.80 or using "Data Element Profiles" in older versions.

How To

Enable Read Zone

Read Zone is an option for the Value Extractor property of a Data Field.

To use this extractor, select a Data Field in a Data Model.
Select the Value Extractor property.
Choose Read Zone from the dropdown menu.

The Read Zone extractor has four Location property options. You must choose one of these options in order for Read Zone to function.

Expand the Read Zone sub-properties.
Choose your Location option.

Each one has slightly different functionality and configurations. The four Location options are as follows:

Fixed Region
Relative Region
Shape Region
Text Region

Each option is detailed in the How To sections below.

Fixed Region

Draw the ZoneTest ExtractionA Word of Caution

Draw The Zone

The Fixed Region option is the simplest to set up. As the name implies, the extraction zone will be fixed on the page. It will stay in the same coordinates for every document. All you need to do is draw the box where you want to extract data.

Expand out the Location sub-properties and select the Bounds property.
Press the ellipsis button at the end.
This will bring up the "Edit Zone" window. Press the "Select Region" button, if it is not already selected.
With your mouse, draw a box around the text you want to select. Remember, any text falling inside of this box will be extracted. Any outside of the box will be missed. Make sure your box is the appropriate size to capture all field values for this document.

You will place a green box on the page. Any text falling within this box will be extracted. You can move the box around the page and use the transform controls on the corners and edges of the box to edit its width and height (as well as using the Left, Top, Width, and Height properties)

Press the "Ok" button when finished placing the zone.

Test Extraction

With a Document Folder selected, press the Test Extraction button to verify our results.

Success! The last name "Cleaugh" is extracted from the OCR text of this document.

Notice the green box around "Cleaugh" on the page only extends to the size of the text value extracted. When configuring Read Zone it can be useful to see the full size of the box you drew earlier. This can lead to some confusion as to what is or is not being extracted and why while testing your Read Zone configurations. That is why the Output Full Region property exists.

Turn the Output Full Region property to True.
Press the "Test Extraction" button again.
This changes absolutely nothing in terms of what data is extracted, but can be useful in your configuration testing. We will keep this property set to True for this example and the other Location option examples.

A Word Of Caution

Remember, the Fixed Region location's extraction zone stays in the same physical location on the page from document to document. If the text your trying to extract shifts locations due to scanning irregularities or a new document format, this method has the potential to extract the wrong data.

For example, take the two documents here. They are the same document, but one has very different margins than the other. While the registration zone we configured earlier falls on the last name on the left, it does not on the document on the right.

Whatever text falls within that extraction zone is extracted. As you can see, for the second document, the text "1. Last Name" is extracted, instead of "Cleugh".

If your documents are not totally uniform, and you're running into issues like this. You may want to explore the other Location options detailed in the tutorials below.

@@ Line 25: / Line 25: @@
 ''Read Zone'' populates data in '''Data Fields''' by drawing a rectangle on a location on a page.  Whatever text was obtained from the '''[[Recognize]]''' activity (either via OCR or native text extraction) that falls within the boundaries of that rectangle (or "zone") populates the '''Data Field'''.
-{|cellpadding=10 cellpadding=5
+{|cellpadding=10 cellpadding=5 style="margin:auto"
 |-style="text-align:center"
 |For the zone drawn on the document...||...the text data falling within that zone will be extracted.
@@ Line 37: / Line 37: @@
 ''Read Zone'' also has the capability to anchor this extraction zone to another location on the document.  For example, due to issues with printing or scanning, the location of the value may shift from document to document.  It's more than possible that zone could extract the data fine on one document but be slightly off on another.
-{|cellpadding=10 cellpadding=5
+{|cellpadding=10 cellpadding=5 style="margin:auto"
 |-style="text-align:center"
 |The margins here are different from the document above...||...resulting in the wrong extracted data.
@@ Line 49: / Line 49: @@
 Several configuration options allow you to place the extraction zone relative to another piece of information.  This serves as an "anchor" for the zone.  Instead of a fixed position on all documents, the zone is placed relative to this anchor's position.  For example, in this case the label "1. Last Name" could be an anchor.  If you can pattern match that field label with regular expression, the zone you draw on the document will extract the value relative to that label's position.
-{|cellpadding=10 cellpadding=5
+{|cellpadding=10 cellpadding=5 style="margin:auto"
 |-style="text-align:center"
 |Anchored off the field label...||...the zone falls on the right page location.
@@ Line 143: / Line 143: @@
 |}
+</tab>
+<tab name="A Word of Caution" style="margin:20px">
+=== A Word Of Caution ===
+{|cellpadding=10 cellspacing=5
+|style="width:40%" valign=top|
+Remember, the ''Fixed Region'' location's extraction zone stays in the same physical location on the page from document to document.  If the text your trying to extract shifts locations due to scanning irregularities or a new document format, this method has the potential to extract the wrong data.
+For example, take the two documents here.  They are the same document, but one has very different margins than the other.  While the registration zone we configured earlier falls on the last name on the left, it does not on the document on the right.
+|
+[[File:Read-zone-how-to-07.png]]
+|-
+|valign=top|
+Whatever text falls within that extraction zone is extracted.  As you can see, for the second document, the text "1. Last Name" is extracted, instead of "Cleugh".
+If your documents are not totally uniform, and you're running into issues like this.  You may want to explore the other '''''Location''''' options detailed in the tutorials below.
+|
+[[File:Read-zone-how-to-08.png]]
+|}
 </tab>

2.90:Read Zone (Value Extractor): Difference between revisions

Revision as of 08:38, 12 October 2020

About

How To

Enable Read Zone

Fixed Region

Draw The Zone

Test Extraction

A Word Of Caution

Relative Region

Shape Region

Text Region

Auto Snap

Re-OCRing the Zone