2023:Read Zone (Value Extractor): Difference between revisions

From Grooper Wiki
Created page with "{|class="wip-box" | '''WIP''' | This article is a work-in-progress or created as a placeholder for testing purposes. This article is subject to change and/or expansion. It may be incomplete, inaccurate, or stop abruptly. This tag will be removed upon draft completion. |} <blockquote> The ''Read Zone'' extractor allows you to extract text data in a rectangular region (called a "extraction zone" or just "zone") on a document. This can be a fixed zone, extracting text f..."
 
m Dgreenwood moved page 2023:Read Zone (Extractor Type) to 2023:Read Zone (Value Extractor) without leaving a redirect
 
(27 intermediate revisions by 3 users not shown)
Line 1: Line 1:
{|class="wip-box"
{{AutoVersion}}
 
<blockquote>{{#lst:Glossary|Read Zone}}</blockquote>
 
{|class="download-box"
|
|
'''WIP'''
[[File:Asset 22@4x.png]]
|
|
This article is a work-in-progress or created as a placeholder for testing purposesThis article is subject to change and/or expansion. It may be incomplete, inaccurate, or stop abruptly.
You may download the ZIP(s) below and upload it into your own Grooper environment (version 2023).  The first contains a '''Project''' with resources used in examples throughout this articleThe second contains one or more '''Batches''' of sample documents.
* [[Media:2023 Wiki Read-Zone Project.zip]]
* [[Media:2023 Wiki Read-Zone Batches.zip]]
|}


This tag will be removed upon draft completion.
== About ==
|}
{|cellpadding=10 cellspacing=5
|valign=top style="width:40%"|


<blockquote>
''Read Zone'' is useful for extracting data from highly structured documents.  If a document's structure is fixed, it's going to have the same fields in the same physical location from one document to the next.
The ''Read Zone'' extractor allows you to extract text data in a rectangular region (called a "extraction zone" or just "zone") on a document. This can be a fixed zone, extracting text from the same location on a document, or a zone relative to an extracted text anchor or shape location on the document.
</blockquote>


== About ==


For instance, the Application for Cow Ownership form to the right seems to be a fairly fixed form. We expect the "Birth Date" listed on the first page to be more or less in the same spot for every single Cow Ownership document. The value itself may change, but there's only so much room that this value can take up on the document.


''Read Zone'' is useful for extracting data from highly structured documents.  If a document's structure is fixed, it's going to have the same fields in the same physical location from one document to the next.  The Closing Disclosure forms we've been looking at in this article are themselves fairly fixed.  For example, the "Loan Amount" listed on the first page is more or less in the same spot for every single Closing Disclosure.  The dollar amount itself may change, but there's only so much room that amount can take up on the document.


If you can draw a rectangle around the value you want to extract, and the value falls within the boundaries of that rectangle for every single document, extraction may be as simple as just extracting the text in the rectangle's location.  This is referred to as "zonal extraction".  You draw a zone where the value exists on the page and return the text data falling in the zone.
If you can draw a rectangle around the value you want to extract, and the value falls within the boundaries of that rectangle for every single document, extraction may be as simple as just extracting the text in the rectangle's location.  This is referred to as "zonal extraction".  You draw a zone where the value exists on the page and return the text data falling in the zone.
|
[[File:2023 Read Zone - 2023 01 About 01.png|500px|center]]
|}


''Read Zone'' has a few different options for where the box is placed using the '''''Location''''' property.  This can be one of four options:
''Read Zone'' has a few different options for where the box is placed using the '''''Location''''' property.  This can be one of four options:


# ''Fixed Region'' - This option is the simplest to set up. As the name implies, the extraction zone will be fixed on the page. It will stay in the same coordinates for every document. All you need to do is draw the box where you want to extract data.
* Fixed Region
# ''Relative Region'' - Instead of setting the extraction zone in a fixed location for every document, the ''Relative Region'' mode will anchor the zone to a text label on the document. The extraction zone's position will change relative to the label's position on the document, but will still have the same drawn dimensions.
* Relative Region
#* This option is useful to overcome issues arising during scanning printed documents.  Slight variations can occur as to where a value is when printing or scanning a document, even for very structured documents.  This can cause problems when drawing a single fixed region for the extraction zone.  However, if you can anchor the zone off an extractable text value, the zone's position will shift according to that anchor's position.
* Text Region
# ''Text Region'' - The ''Text Region'' option creates an extraction zone using the logical boundaries of an extraction result. This can return all the text falling within the boundaries of the rectangle around the extractor's result.
* Shape Region
#* This can also be configured to provide results in a similar way the ''Relative Region'' option does, using text anchors located by an extractor to position the extraction zone's location. This means both methods can be used to position the zone relative to a point from document to document. The main difference is in how the zone is drawn.
# ''Shape Region'' - The Shape Region option is extremely similar to the ''Text Region'' option. However, instead of using text to anchor the extraction zone, it uses a shape detected from a '''Shape Detection''' or '''Shape Removal''' '''IP Command'''.
#* This is the least common method used.


The ''Read Zone'' extractor can optionally re-process the text data with an '''OCR Profile'''.  This can be used to perform custom OCR on the extracted text.
The ''Read Zone'' extractor can optionally re-process the text data with an '''OCR Profile'''.  This can be used to perform custom OCR on the extracted text.
Line 35: Line 41:
== How To ==
== How To ==


=== The Location Property ===
==== Fixed Region ====
This option is the simplest to set up. As the name implies, the extraction zone will be fixed on the page. It will stay in the same coordinates for every document. All you need to do is draw the box where you want to extract data.
[[File:2023 Read Zone - 2023 02 How To 01 Fixed Region 01.png]]
[[File:2023 Read Zone - 2023 02 How To 01 Fixed Region 02.png]]
[[File:2023 Read Zone - 2023 02 How To 01 Fixed Region 03.png]]
[[File:2023 Read Zone - 2023 02 How To 01 Fixed Region 04.png]]
[[File:2023 Read Zone - 2023 02 How To 01 Fixed Region 05.png]]
[[File:2023 Read Zone - 2023 02 How To 01 Fixed Region 05a.png]]
[[File:2023 Read Zone - 2023 02 How To 01 Fixed Region 06.png]]
==== Relative Region ====
Instead of setting the extraction zone in a fixed location for every document, the ''Relative Region'' mode will anchor the zone to a text label on the document. The extraction zone's position will change relative to the label's position on the document, but will still have the same drawn dimensions.
This option is useful to overcome issues arising during scanning printed documents.  Slight variations can occur as to where a value is when printing or scanning a document, even for very structured documents.  This can cause problems when drawing a single fixed region for the extraction zone.  However, if you can anchor the zone off an extractable text value, the zone's position will shift according to that anchor's position.
[[File:2023 Read Zone - 2023 02 How To 02 Relative Region 01.png]]
[[File:2023 Read Zone - 2023 02 How To 02 Relative Region 02.png]]
[[File:2023 Read Zone - 2023 02 How To 02 Relative Region 03.png]]
[[File:2023 Read Zone - 2023 02 How To 02 Relative Region 04.png]]
[[File:2023 Read Zone - 2023 02 How To 02 Relative Region 05.png]]
[[File:2023 Read Zone - 2023 02 How To 02 Relative Region 06.png]]
[[File:2023 Read Zone - 2023 02 How To 02 Relative Region 07.png]]
[[File:2023 Read Zone - 2023 02 How To 02 Relative Region 08.png]]
[[File:2023 Read Zone - 2023 02 How To 02 Relative Region 09.png]]
[[File:2023 Read Zone - 2023 02 How To 02 Relative Region 10.png]]
[[File:2023 Read Zone - 2023 02 How To 02 Relative Region 11.png]]
===== Auto Snap =====
On many documents, such as the Application for Cow Ownership document we have been using in these examples so far, you will have a grid of lines enclosing the data you want to return. This can also be found in things like tables.
Grooper can use these lines as guides to determine what needs to be extracted. You can use this feature by enabling the Auto Snap property.
[[File:2023 Read Zone - 2023 02 How To 03 Auto Snap 01.png]]
[[File:2023 Read Zone - 2023 02 How To 03 Auto Snap 02.png]]
[[File:2023 Read Zone - 2023 02 How To 03 Auto Snap 03.png]]
[[File:2023 Read Zone - 2023 02 How To 03 Auto Snap 04.png]]
[[File:2023 Read Zone - 2023 02 How To 03 Auto Snap 05.png]]
==== Text Region ====
The ''Text Region'' option creates an extraction zone using the logical boundaries of an extraction result. This can return all the text falling within the boundaries of the rectangle around the extractor's result.
This can also be configured to provide results in a similar way the ''Relative Region'' option does, using text anchors located by an extractor to position the extraction zone's location. This means both methods can be used to position the zone relative to a point from document to document. The main difference is in how the zone is drawn.
[[File:2023 Read Zone - 2023 02 How To 04 Text Region 01.png]]
[[File:2023 Read Zone - 2023 02 How To 04 Text Region 02.png]]
[[File:2023 Read Zone - 2023 02 How To 04 Text Region 03.png]]
[[File:2023 Read Zone - 2023 02 How To 04 Text Region 04.png]]
[[File:2023 Read Zone - 2023 02 How To 04 Text Region 05.png]]
[[File:2023 Read Zone - 2023 02 How To 04 Text Region 06.png]]
[[File:2023 Read Zone - 2023 02 How To 04 Text Region 07.png]]
Now, it might seem that the ''Text Region'' extractor pretty much does the same thing as the ''Relative Region'' extractor, but with more steps... and you would be correct. Generally it is more advantageous to use ''Relative Region'', but there is one thing that ''Text Region'' can do a little more easily and it involves '''''Auto Snap'''''.
[[File:2023 Read Zone - 2023 02 How To 04 Text Region 08.png]]
==== Shape Region ====
The Shape Region option is extremely similar to the ''Text Region'' option. However, instead of using text to anchor the extraction zone, it uses a shape detected from a '''Shape Detection''' or '''Shape Removal''' '''IP Command'''.
This is the least common method used.
[[File:2023 Read Zone - 2023 02 How To 05 Shape Region 01.png]]
[[File:2023 Read Zone - 2023 02 How To 05 Shape Region 02.png]]
[[File:2023 Read Zone - 2023 02 How To 05 Shape Region 03.png]]
[[File:2023 Read Zone - 2023 02 How To 05 Shape Region 04.png]]
[[File:2023 Read Zone - 2023 02 How To 05 Shape Region 05.png]]
[[File:2023 Read Zone - 2023 02 How To 05 Shape Region 06.png]]
Running OCR on a smaller, more specific area can give more accurate results than running OCR on the whole page. We hope to improve the OCR of these stamps by adding an '''OCR Profile''' to the '''Value Reader''' we are configuring.
[[File:2023 Read Zone - 2023 02 How To 05 Shape Region 07.png]]


{|cellpadding=10 cellpadding=5
|valign=top style="width:40%"|
In this example, a '''Value Reader''' is configured to return the "Loan Amount" value as described on the first page of a Closing Disclosure form, using the ''Read Zone'' '''''Extractor Type'''''.


# ''Read Zone'' is selected as the '''''Extractor Type'''''
[[File:2023 Read Zone - 2023 02 How To 05 Shape Region 08.png]]
# For any ''Read Zone'' configuration you ''must'' configure the '''''Location''''' property.  This determines where the extraction zone is placed on each document.
#* In this case, we configured the ''Relative Region'' option.  We're using the text label "Loan Amount" as the anchor for the drawn extraction zone.  You fully configure whichever '''''Location''''' mode you choose by expanding and configuring its sub-properties.
#** The extracted anchor is seen in the "Document Viewer" outlined in blue.
#** The extraction zone is seen highlighted in green.  Any text falling within that green box is returned as the result.
# The '''''Output Full Region''''' property is very handy.  It doesn't change the result at all, it just shows the full size of the drawn zone on the page.  This is extremely useful when testing and configuring the ''Read Zone'' extractor.
# If '''''Output Full Region''''' were set to ''False'', ''only'' the text ($ 159,432.62) would be highlighted, not the full drawn zone seen here.
|
[[File:Value-reader-extractor-types-10.png]]
|}

Latest revision as of 14:43, 27 August 2025

This article is about an older version of Grooper.

Information may be out of date and UI elements may have changed.

2025202320212.90

Read Zone is a Value Extractor that allows you to extract text data in a rectangular region (called an "extraction zone" or just "zone") on a document. This can be a fixed zone, extracting text from the same location on a document, or a zone relative to a text value (such as a label) or a shape location on the document.

You may download the ZIP(s) below and upload it into your own Grooper environment (version 2023). The first contains a Project with resources used in examples throughout this article. The second contains one or more Batches of sample documents.

About

Read Zone is useful for extracting data from highly structured documents. If a document's structure is fixed, it's going to have the same fields in the same physical location from one document to the next.


For instance, the Application for Cow Ownership form to the right seems to be a fairly fixed form. We expect the "Birth Date" listed on the first page to be more or less in the same spot for every single Cow Ownership document. The value itself may change, but there's only so much room that this value can take up on the document.


If you can draw a rectangle around the value you want to extract, and the value falls within the boundaries of that rectangle for every single document, extraction may be as simple as just extracting the text in the rectangle's location. This is referred to as "zonal extraction". You draw a zone where the value exists on the page and return the text data falling in the zone.


Read Zone has a few different options for where the box is placed using the Location property. This can be one of four options:

  • Fixed Region
  • Relative Region
  • Text Region
  • Shape Region

The Read Zone extractor can optionally re-process the text data with an OCR Profile. This can be used to perform custom OCR on the extracted text.

The text in the zone can also be itself extracted by a Value Extractor. This allows you to break up the document into a smaller portion and run an extractor on just the zone instead of the full document. Essentially, you use the Read Zone extractor to create a smaller data instance (from the larger document data instance) and use its Value Extractor property to return data from the smaller data instance.

How To

The Location Property

Fixed Region

This option is the simplest to set up. As the name implies, the extraction zone will be fixed on the page. It will stay in the same coordinates for every document. All you need to do is draw the box where you want to extract data.









Relative Region

Instead of setting the extraction zone in a fixed location for every document, the Relative Region mode will anchor the zone to a text label on the document. The extraction zone's position will change relative to the label's position on the document, but will still have the same drawn dimensions.

This option is useful to overcome issues arising during scanning printed documents. Slight variations can occur as to where a value is when printing or scanning a document, even for very structured documents. This can cause problems when drawing a single fixed region for the extraction zone. However, if you can anchor the zone off an extractable text value, the zone's position will shift according to that anchor's position.













Auto Snap

On many documents, such as the Application for Cow Ownership document we have been using in these examples so far, you will have a grid of lines enclosing the data you want to return. This can also be found in things like tables.

Grooper can use these lines as guides to determine what needs to be extracted. You can use this feature by enabling the Auto Snap property.







Text Region

The Text Region option creates an extraction zone using the logical boundaries of an extraction result. This can return all the text falling within the boundaries of the rectangle around the extractor's result.

This can also be configured to provide results in a similar way the Relative Region option does, using text anchors located by an extractor to position the extraction zone's location. This means both methods can be used to position the zone relative to a point from document to document. The main difference is in how the zone is drawn.









Now, it might seem that the Text Region extractor pretty much does the same thing as the Relative Region extractor, but with more steps... and you would be correct. Generally it is more advantageous to use Relative Region, but there is one thing that Text Region can do a little more easily and it involves Auto Snap.



Shape Region

The Shape Region option is extremely similar to the Text Region option. However, instead of using text to anchor the extraction zone, it uses a shape detected from a Shape Detection or Shape Removal IP Command.

This is the least common method used.








Running OCR on a smaller, more specific area can give more accurate results than running OCR on the whole page. We hope to improve the OCR of these stamps by adding an OCR Profile to the Value Reader we are configuring.