2.90:Read Zone (Value Extractor): Difference between revisions

From Grooper Wiki
No edit summary
Line 373: Line 373:


=== Auto Snap - Using Line Layout Data to Your Advantage ===
=== Auto Snap - Using Line Layout Data to Your Advantage ===
Many documents organize fields into cells, effectively placing the value you want to extract in a box.  You can use this to your advantage by "snapping" the extraction zone to the lines around it.  This can allow you to get close to the right zone first and then expand the zone's boundaries to the edges of the whole box.
<tabs style="margin:20px">
<tab name="Prereqs - Layout Data Collection" style="margin:20px">
=== Prereqs - Layout Data Collection ===
In order to snap to lines, you must first find and save that line location information.  This can be done with a '''Line Detection''' or '''Line Removal''' IP command in an '''[[IP Profile]]'''.  After applying that '''IP Profile''' during an '''[[Image Processing]]''' or '''[[Recognize]]''' activity, that data will be saved to the page's "LayoutData.json" file in Grooper.
</tab>
<tab name="Establish the Initial Zone" style="margin:20px">
=== Establish the Initial Zone ===
{|cellpadding=10 cellspacing=5
|style="width:40%" valign=top|
All '''''Location''''' options (''Fixed Region'', ''Relative Region'', ''Shape Region'', and ''Text Region'') have an '''''Auto Snap''''' property.  Enabling this property will allow snapping the zone to lines surrounding it.  However, first you must establish the initial extraction zone.
For this example, we will have set the '''''Location''''' property to ''Text Region''.  We have set the '''''Text Extractor''''' to a ''Text Pattern'' matching "5. E-mail Address"
You can see here, our zone returns the whole label.
|
[[File:Read-zone-how-to-25.png]]
|}
</tab>
<tab name="Enable Auto Snap" style="margin:20px">
=== Enable Auto Snap ===
{|cellpadding=10 cellspacing=5
|style="width:40%" valign=top|
# Change the '''''Auto Snap''''' property from ''Disabled'' to ''Enabled''.
# Since we've obtained line location information saved on this page's "LayoutData.json" file, the extraction zone expands to the line boundaries.  Each edge of the zone expands until a line is found, returning the contents of the entire cell.
|
[[File:Read-zone-how-to-26.png]]
|}
</tab>
<tab name="Optionally Exclude the Anchor's Text" style="margin:20px">
=== Optionally Exclude the Anchor's Text ===
{|cellpadding=10 cellspacing=5
|style="width:40%" valign=top|
# Notice since the ''entire'' zone's text is returned, that includes the anchor's result returned by the text extractor.  Often, this is some kind of label, and you don't really want this result returned, as is the case here.  That is what the '''''Exclude Anchor''''' property is for.
|
[[File:Read-zone-how-to-27.png]]
|-
|valign=top|
# You can remove the text extracted by the anchor or text extractor by turning the '''''Exclude Anchor''''' property to ''True''.
# This will delete the anchor's text from the result.
#* Here, this leaves us with just the email address.
|
[[File:Read-zone-how-to-28.png]]
|}
</tab>
</tabs>


=== Re-OCRing the Zone ===
=== Re-OCRing the Zone ===


== Version Differences ==
== Version Differences ==


''Read Zone'' is a new '''''Value Extractor''''' option available to '''Data Fields''' in Grooper Version 2.90.  In version 2.80, similar functionality could be achieved via the [[Anchored Extract]] and [[Zonal Extract]] options.  In versions older than 2.80, similar functionality could be achieved using "Data Element Profiles" of '''Document Type''' objects.
''Read Zone'' is a new '''''Value Extractor''''' option available to '''Data Fields''' in Grooper Version 2.90.  In version 2.80, similar functionality could be achieved via the [[Anchored Extract]] and [[Zonal Extract]] options.  In versions older than 2.80, similar functionality could be achieved using "Data Element Profiles" of '''Document Type''' objects.

Revision as of 14:49, 12 October 2020

Read Zone allows you to extract text data in a rectangular region (called a "extraction zone" or just "zone") on a document. This can be a fixed zone, extracting text from the same location on a document, or a zone relative to an extracted text anchor or shape location on the document.

Read Zone is a Value Extractor option available to Data Fields in a Data Model.

About

Highly structured documents organize information into a series of data fields. These fields will have a label identifying what the field contains, such as "Name", and a corresponding value, such as "John Doe". While the values for these fields will change from document to document, their position on the document will remain constant.

The Read Zone extractor extracts data using this feature of document layouts.

As long as you can be reasonably assured the data you want to find will be in the same spot from document to document, you don't necessarily need anything fancier than extracting whatever text is in that known location.

Read Zone populates data in Data Fields by drawing a rectangle on a location on a page. Whatever text was obtained from the Recognize activity (either via OCR or native text extraction) that falls within the boundaries of that rectangle (or "zone") populates the Data Field.

For the zone drawn on the document... ...the text data falling within that zone will be extracted.

Read Zone also has the capability to anchor this extraction zone to another location on the document. For example, due to issues with printing or scanning, the location of the value may shift from document to document. It's more than possible that zone could extract the data fine on one document but be slightly off on another.

The margins here are different from the document above... ...resulting in the wrong extracted data.

Several configuration options allow you to place the extraction zone relative to another piece of information. This serves as an "anchor" for the zone. Instead of a fixed position on all documents, the zone is placed relative to this anchor's position. For example, in this case the label "1. Last Name" could be an anchor. If you can pattern match that field label with regular expression, the zone you draw on the document will extract the value relative to that label's position.

Anchored off the field label... ...the zone falls on the right page location.

FYI Read Zone is new to version 2.90. Similar functionality was performed by Zonal Extract and Anchored Extract in version 2.80 or using "Data Element Profiles" in older versions.

How To

Enable Read Zone

Read Zone is an option for the Value Extractor property of a Data Field.

  1. To use this extractor, select a Data Field in a Data Model.
  2. Select the Value Extractor property.
  3. Choose Read Zone from the dropdown menu.

The Read Zone extractor has four Location property options. You must choose one of these options in order for Read Zone to function.

  1. Expand the Read Zone sub-properties.
  2. Choose your Location option.

Each one has slightly different functionality and configurations. The four Location options are as follows:

  • Fixed Region
  • Relative Region
  • Shape Region
  • Text Region

Each option is detailed in the How To sections below.

Fixed Region

The Fixed Region option is the simplest to set up. As the name implies, the extraction zone will be fixed on the page. It will stay in the same coordinates for every document. All you need to do is draw the box where you want to extract data.

Draw The Zone Bounds

  1. Expand out the Location sub-properties and select the Bounds property.
  2. Press the ellipsis button at the end.
  3. This will bring up the "Edit Zone" window. Press the "Select Region" button, if it is not already selected.
  4. With your mouse, draw a box around the text you want to select. Remember, any text falling inside of this box will be extracted. Any outside of the box will be missed. Make sure your box is the appropriate size to capture all field values for this document.

You will place a green box on the page. Any text falling within this box will be extracted. You can move the box around the page and use the transform controls on the corners and edges of the box to edit its width and height (as well as using the Left, Top, Width, and Height properties)

  1. Press the "Ok" button when finished placing the zone.

Test Extraction

  1. With a Document Folder selected, press the Test Extraction button to verify our results.

Success! The last name "Cleaugh" is extracted from the OCR text of this document.

  1. Notice the green box around "Cleaugh" on the page only extends to the size of the text value extracted. When configuring Read Zone it can be useful to see the full size of the box you drew earlier. This can lead to some confusion as to what is or is not being extracted and why while testing your Read Zone configurations. That is why the Output Full Region property exists.

  1. Turn the Output Full Region property to True.
  2. Press the "Test Extraction" button again.
  3. This changes absolutely nothing in terms of what data is extracted, but can be useful in your configuration testing. We will keep this property set to True for this example and the other Location option examples.

A Word Of Caution

Remember, the Fixed Region location's extraction zone stays in the same physical location on the page from document to document. If the text your trying to extract shifts locations due to scanning irregularities or a new document format, this method has the potential to extract the wrong data.

For example, take the two documents here. They are the same document, but one has very different margins than the other. While the registration zone we configured earlier falls on the last name on the left, it does not on the document on the right.

Whatever text falls within that extraction zone is extracted. As you can see, for the second document, the text "1. Last Name" is extracted, instead of "Cleugh".

If your documents are not totally uniform, and you're running into issues like this. You may want to explore the other Location options detailed in the tutorials below.

Relative Region

Instead of setting the extraction zone in a fixed location for every document, the Relative Region mode will anchor the zone to a text label on the document. Its position will change relative to the label's position on the document, but will still have the same drawn dimensions.


Set the Label Extractor

The first thing you must do is return a label on the document with an extractor. This is the "anchor" for the extraction zone.

  1. With the Relative Region method's sub-properties expanded, expand the Anchor properties.
  2. Select the Label Extractor property. This extractor will locate the text in the document you want to anchor the extraction zone to.
    • This can be a simple internal Text Pattern or a Reference to an extractor elsewhere in the Node Tree. In our case, the label can be returned with a simple regular expression. we have chosen Text Pattern

Regardless whether you use a Text Pattern or a Reference extractor, your goal will be the same. You want to return some field label identifying the value you want to extract. If we want to ultimately extract the first name from these documents, we know that name will be next to the label "2. First Name".

Even if the position of the text in that field changes due to irregularities like we've seen before, the value's location (here "Anissa") should stay more or less the the same relative to the label "2. First Name".

In this case, a simple regular expression in the Value Pattern will match that label's text.

2\. First Name

Note: It also produces matches on the second and third pages of this document. In our case, this will not matter. The relative distance from the label and the value is functionally the same on each page and produces the same result. And, the first value on the first page is what will be returned to our Data Model. However, be aware of this as a potential issue. You may need to narrow down your results to the proper label using various extraction techniques, such as a page filter.

Set the Label Location

Once the anchor's text is returned, Grooper has positional coordinates for the anchor. Next, you must set where the anchor's location is in relation to this anchor.

  1. First, you muse set the Relative To property.
    • By default, this set to TopLeft. This means, the anchor point will be the top left corner of the logical boundaries of the anchor extractor's result. You can change this to a variety of other positions, just as the bottom right corner or middle center of the result.
  2. Select the Location property and press the ellipsis button at the end.
  3. This will bring up the "Select Anchor Dialog" window. In the "Results" panel, you will see the list of results returned by your anchor extractor. Select the result whose position you want to use.
  4. Notice there is a red dot in the top left corner of the anchor result selected. This is the relative anchor point that will be used for placing the extraction zone. It is in the top left corner of the result because we kept the Relative To property set to TopLeft".

Set the Zone Bounds

Next, just like for the Fixed Region mode, you must draw a box for the extraction zone.

  1. Select the Bounds property and press the ellipsis button at the end.
  2. This will bring up the "Edit Zone" window. Press the "Select Region" button, if it is not already selected.
  3. With your mouse, draw a box around the text you want to select. Remember, any text falling inside of this box will be extracted. Any outside of the box will be missed. Make sure your box is the appropriate size to capture all field values for this document.
  4. Press the "Ok" button when finished placing the zone.

Test Extraction

This is the bare minimum requirements for setting up the Relative Region mode's extraction. This will properly extract the First Name field for both the document with normal margins...

...and the one with the abnormal margins. The extraction zone is no longer a fixed location from document to document, but is placed relative the the text anchor's location. The anchor is outlined in blue on the document.

Note: The 'Output Full Region property here is set to True, displaying the full extraction zone in green on the document.

Shape Region

Text Region

The Text Region option creates an extraction zone using the logical boundaries of an extraction result. This can just return all the text falling within the boundaries of the rectangle around the extractor's result.

This can also be configured to provide results in a similar way the Relative Region option does, using text anchors located by an extractor to position the extraction zone's location. This means both methods can be used to position the zone relative to a point from document to document. The main difference is in how the zone is drawn.

Set the Text Extractor

The first thing you must do is return a text label on the document with an extractor. This is the "anchor" for the extraction zone.

  1. With the Text Region method's sub-properties expanded, select the Text Extractor property to set an extractor. This extractor's result will draw the box around the portion of the document you want to extract.
    • This can be a simple internal Text Pattern or a Reference to an extractor elsewhere in the Node Tree.
    • In our case, we can match what we want with a simple regular expression. We have chosen Text Pattern

Regardless whether you use a Text Pattern or a Reference extractor, our goal will be the same. For this tutorial, we want to return some field label identifying the value you want to extract. If we want to ultimately extract the date of birth value from these documents, we know the birthdate will be next to the label "4. Birth Date".

Even if the position of the text in that field changes due to irregularities like we've seen before, the value's location (here "3/27/95") should stay more or less the the same relative to the label "4. Birth Date".

In this case, a simple regular expression in the Value Pattern will match that label's text.

4\. Birth Date

If you press the "Test Extraction" button, you'll see where the Text Region option starts to differ from the Relative Region property.

With no further configuration, the extraction zone drawn is simply the boundaries of the Text Extractor result. So, we see "4. Birth Date" populating the field.

If we want to use this as a positional anchor to find the actual Birth Date "3/27/95", we must configure the Anchor Point, Translation, and Adjustment properties.

Adjusting the Anchor Point of the Zone

There are a variety of ways we could configure the extraction zone to extract what we want. First, we will look at the Anchor Point property and its effects on the extraction zone.

Similar to Relative Region we can make a relative anchor point from the text extractor's result.

  1. Select the Anchor Point property and expand the dropdown list to select an anchor point. This will give us a position on the page from which the extraction zone can be drawn.
    • We will choose TopLeft for this example.

This will give us two new properties: Move To and Size

The Move To property will move the extraction zone from whatever anchor point you selected to a new position within the text extractor's result boundaries.

  1. Here, we have set it to BottomLeft
  2. Upon testing extraction, you can see the extraction zone remains the same size as before, but has been moved. The original Anchor Point of TopLeft has been moved to the BottomLeft corner.

However, the zone is too small to extract anything.

The Size property allows you to alter the size of the extraction zone.

  1. Expand the Size property and change the Width to 1in and Height to 0.5in
  2. This will create a 1 inch by 0.5 inch extraction zone, starting at the anchor point's location.
  3. The date now falls within that zone and is successfully extracted.

Adjusting the Translation and Adjustment Properties

You can accomplish the same goal using the Translation and Adjustment properties. These properties also manipulate the size and location of the extraction zone, just in different ways.

The Translation property will move the extraction zone (supplied by the text extractor) across the X and/or Y axis of the page. Logically, we want the extraction zone to be slighting below the label "4. Birth Date"

  1. Expand the Translation properties and enter 0in for the X translation shift and 0.1in for the Y.
  2. Upon testing extraction, you can see we moved the box 0.1 inches down along the Y axis and no inches to the left or right along the X axis.
    • Positive values will move the down the Y axis and right along the X axis
    • Negative values will move up the Y axis and left along the X axis.


Even if you do not want to shift the translation across either the X or Y axis, you must enter in a value of "0in" like we did above. You will get an error if you configure one axis but leave the other blank.

The Adjustment property will adjust the size of the extraction zone. Here, you can adjust the size of the Left, Right, Top, and Bottom edges of the extraction zone.

  1. Expand the Adjustment properties and enter 0.5in for the Right adjustment and 0.25in for the Bottom.
  2. Upon testing extraction, you can see we expanded the box's right boundary 0.5 inches and expanded the bottom boundary 0.25 inches.
    • Positive values will expand a boundary. Negative values will shrink a boundary.
  3. Now, since the extraction zone fully overlaps the date, it is extracted properly.


FYI The main difference between the Adjustment property and the Size property seen when adjusting the Anchor Point is the Size property is an absolute change to the extraction zone's size where the Adjustment change is relative. So, if the size of the text extractor's result is larger on one document but smaller on another, it will adjust the extraction zone's size relative to the result on each document. However, the Size property normalizes the size of the extraction zone across all documents.

Auto Snap - Using Line Layout Data to Your Advantage

Many documents organize fields into cells, effectively placing the value you want to extract in a box. You can use this to your advantage by "snapping" the extraction zone to the lines around it. This can allow you to get close to the right zone first and then expand the zone's boundaries to the edges of the whole box.

Prereqs - Layout Data Collection

In order to snap to lines, you must first find and save that line location information. This can be done with a Line Detection or Line Removal IP command in an IP Profile. After applying that IP Profile during an Image Processing or Recognize activity, that data will be saved to the page's "LayoutData.json" file in Grooper.

Establish the Initial Zone

All Location options (Fixed Region, Relative Region, Shape Region, and Text Region) have an Auto Snap property. Enabling this property will allow snapping the zone to lines surrounding it. However, first you must establish the initial extraction zone.

For this example, we will have set the Location property to Text Region. We have set the Text Extractor to a Text Pattern matching "5. E-mail Address"

You can see here, our zone returns the whole label.

Enable Auto Snap

  1. Change the Auto Snap property from Disabled to Enabled.
  2. Since we've obtained line location information saved on this page's "LayoutData.json" file, the extraction zone expands to the line boundaries. Each edge of the zone expands until a line is found, returning the contents of the entire cell.

Optionally Exclude the Anchor's Text

  1. Notice since the entire zone's text is returned, that includes the anchor's result returned by the text extractor. Often, this is some kind of label, and you don't really want this result returned, as is the case here. That is what the Exclude Anchor property is for.

  1. You can remove the text extracted by the anchor or text extractor by turning the Exclude Anchor property to True.
  2. This will delete the anchor's text from the result.
    • Here, this leaves us with just the email address.

Re-OCRing the Zone

Version Differences

Read Zone is a new Value Extractor option available to Data Fields in Grooper Version 2.90. In version 2.80, similar functionality could be achieved via the Anchored Extract and Zonal Extract options. In versions older than 2.80, similar functionality could be achieved using "Data Element Profiles" of Document Type objects.