Difference between revisions of "Value Reader - 2021"

From Grooper Wiki
Jump to navigation Jump to search
Line 222: Line 222:
 
=== Label Match ===
 
=== Label Match ===
  
 +
The ''Label Match'' extractor is ''extremely'' similar to the ''List Match'' extractor in that the extractor matches one or more items in a defined list.  However, it is designed specifically to work with the ''[[Labeling Behavior]]'' functionality (also referred to as "Label Sets".  It will use the fuzzy extraction settings defined on the '''Content Model''' if a ''Labeling Behavior'' is enabled.  This way, you can have a single set of fuzzy extraction settings, including confidence score settings and fuzzy weighting settings referenced by multiple extractors, rather than configuring those settings for multiple extractors.
 +
* For more information on fuzzy extraction, visit the [[Fuzzy RegEx]] article.
 +
 +
For the ''Label Match'' extractor to return a result, two conditions must be met.
 +
 +
# The document folder must be classified.
 +
#* In other words, it must have a '''Document Type''' assigned to it.
 +
# That '''Document Type''' must have a ''Labeling Behavior'' enabled.
 +
#* Either on the '''Document Type''' or, more typically, its parent '''Content Model'''.
 +
 +
{|cellpadding=10 cellspacing=5
 +
|valign=top style="width:40%"|
 +
# The '''Content Model''' selected here, has enabled a ''Labeling Behavior''.
 +
# ''Labeling Behavior'' is enabled using the '''''Behaviors''''' property...
 +
# ...and added using the collection editor seen here.
 +
#* For more information on the ''Labeling Behavior'', how to enable it, its configuration, and its utility, visit the ''[[Labeling Behavior]]'' article.
 +
# The ''Label Match'' extractor will use all the fuzzy extraction and text wrapping settings defined here.
 +
|
 +
[[File:Value-reader-extractor-types-label match-01.png]]
 +
|}
 +
 +
{|cellpadding=10 cellpadding=5
 +
|valign=top style="width:40%"|
 +
In this example, a '''Value Reader''' is configured to return a small list of field labels on an invoice, using the ''Label Match'' '''''Extractor Type'''''
 +
 +
# ''Label Match'' is selected as the '''''Extractor Type'''''
 +
# The list is entered in the '''''Local Entries''''' editor (just like you do with the ''List Match'' extractor).
 +
#* Or, you can reference a '''Lexicon''' of list items using the "Properties" tab.
 +
# The '''''Prefix''' and '''Suffix Patterns''''' are entered here.
 +
#* <code>^|[^\w]</code> is the default '''''Prefix Pattern'''''.
 +
#* <code>$|[^\w]</code> is the default '''''Suffix Pattern'''''.
 +
# The document we have selected is classified as an "Invoice" '''Document Type'''.
 +
# This is a '''Document Type''' in the '''Content Model''' with the ''Labeling Behavior'' enabled.
 +
# Upon execution, notice some results are returned with a confidence ''below'' 100%.
 +
#* This is due to the fuzzy matching settings configured from the ''Labeling Behavior''.  The '''''Label Similarity''''' property was set to ''90%''.  Any items in the list with a fuzzy matching similarity score above 90% are returned.  Any falling below 90% (for example the list item <code>CALLER:</code>) are not.
 +
#* Note this means changing the ''Labeling Behavior'' settings will impact ALL ''Label Match'' extractors for the '''Content Model's''' '''Document Types'''.
 +
|
 +
[[File:Value-reader-extractor-types-label match-02.png]]
 +
|}
 
</tab>
 
</tab>
 
<tab name="Word Match" style="margin:20px">
 
<tab name="Word Match" style="margin:20px">

Revision as of 15:38, 3 May 2021

2021

This article is in development for the upcoming version of Grooper, Grooper 2021. The Value Reader is a new data extraction object in 2021. This information is incomplete and/or may change by the time of release.
Graphic depicting the Grooper Value Reader

The Value Reader is a data extraction tool in Grooper. This object allows users to return values from a document in a variety of ways, including regular expression pattern matching, optical mark recognition, and barcode detection.

Value Readers are Grooper's "one stop shop" for data extraction. They return a single or a list of numerical or lexical results from a page or document folder's text data (obtained via OCR or native text extraction from the Recognize activity).

About

The Value Reader is a new extraction object introduced in Grooper 2021. It is designed to expand on the extractor functionality of Grooper's regular expression pattern matching capabilities to include newer extraction capabilities, such as extracting values next to OMR (optical mark recognition) checkboxes and barcode values. In previous versions, this functionality was split across multiple objects (or properties of multiple objects). The Value Reader extractor combines these disparate functionalities into a single extractor object with increased functionality. This object forms the foundation for extracting information from a document, using a variety of different methods.

  • Do you need to extract a date? A Value Reader can do that!
  • Do you need to extract anything matching a list of values? A Value Reader can do that!
  • Do you need to extract English language unigrams (or bigrams etc)? A Value Reader can do that!
  • Do you need to extract a value from a barcode? A Value Reader can do that!
  • Do you need to extract the label next to a checked checkbox? A Value Reader can do that!

Do you need to find any value at all? You're going to use some kind of configuration of the Value Reader to do it.

Value Readers locate results using a variety of Extractor Types. The very first thing you will do when creating a Value Reader is decide which Extractor Type suits your extraction needs.

  1. With a Value Reader created and selected in the node tree...
  2. You will see the Extractor Type property at the top of the Value Reader's UI.
  3. Use the drop down menu to select which Extractor Type you wish to use.

Value-reader-01.png

User Interface

Regardless of the Extractor Type selected the Value Reader UI can be divided into five sections:

  1. The Value Reader Tool Bar.
    • This tool bar is used to select the Extractor Type, test its configuration, and save the edited settings.
      • The "Test Single" button will run the extractor against a document folder or page selected in the "Batch Selector".
      • You can also use the Value-reader-04.png button to automatically execute the extractor every time the extractor is edited or a new document folder or page is selected.
      • The "Test All" button will run the extractor against all document folders in the Test Batch selected in the "Batch Selector". If the extractor fails to produce any result, the document folder will be flagged.
  2. The Extractor Type configuration window.
    • Here, the selected Extractor Type is configured to return data. Depending on the specific Extractor Type selected, this panel will change somewhat. Each Extractor Type has its own set of required and optional properties to extract text from a document.
  3. The "Batch Selector" window.
    • Here, a Test Batch is selected to test the extractor.
  4. The "Document Viewer" window.
    • This provides the user a visual interface with a document folder or page selected in the "Batch Viewer". Notably, results are highlighted in green on the page.
    • You can also switch to the "Text Input" tab to view the text data obtained via OCR or native text extraction from the Recognize activity.
    • The "Diagnostics" tab provides additional information to the Design Studio user about the extraction. This can be a useful troubleshooting tool when configuring various Extractor Types.
  5. The "Results Panel" window.
    • This panel shows you a list of all results returned by the extractor.

Value-reader-05.png

Extractor Types

The Extractor Type options fall into one of five categories.

Category Extractor Types Comments

Text Parsing Extractors

  • Pattern Match
  • List Match
  • Word Match
  • Labeled Value
  • Read Substring
  • Field Match

These Extractor Types primarily rely on regular expression, lists of values (such as a Lexicon of field labels) or other forms of text parsing to return values.

  • Note: This does not mean other Extractor Types do not or cannot use regular expression or parse text as part of their functionality. Far from it (In very general terms, Grooper's "data extraction" is itself a form of text parsing in one way or another). These Extractor Types just use it more foundationally for their functionality.

OMR Extractors

  • Labeled OMR
  • Ordered OMR
  • Zonal OMR

These Extractor Types allow you to return values using optical mark recognition. These are useful for extracting values on documents that use checkboxes to detail information.

Barcode Extractors

  • Find Barcode
  • Read Barcode

These Extractor Types allow you to return a value stored in a barcode.

Zonal Extractors

  • Highlight Zone
  • Read Zone

These Extractor Types are used to draw a logical rectangle somewhere on a document and return the text falling inside. These are useful for extracting values on highly structured documents where field values are consistently located on the same position on the page for every document.

The Reference extractor

  • Reference

The Reference option allows you to return the results of another extractor, whether that is a Value Reader, a Data Type, or a Field Class.

Text Parsing Extractors

Pattern Match

The Pattern Match extractor relies on regular expression (regex) pattern matching to return values. This is truly the foundation for almost all data extraction in Grooper. A regex pattern entered in the Value Pattern will run against the selected document, page, or data instance's text data. Matching results will be returned as this extractor's values.

You can also enter Prefix and Suffix Patterns to only return data if the text matched by the Value Pattern also matches a regex pattern before or after. These are useful for anchoring the value you want to return next to some other piece of text. For example, a Prefix Pattern of \n could be used to only return results at the start of a new line because the \n character precedes every new line in the text data. Furthermore, only the data matched by the Value Pattern is returned.

The Output Format allows you to alter the output result for data cleansing or other purposes.

The "Properties" tab allows you to further configure the regex extraction. Here, you can enable Tab Marking, Fuzzy RegEx mode, filter results based on page location, determine case sensitivity, use a Lexicon to perform lookup operations, and more.

In this example, a Value Reader is configured to return currency values, using the Pattern Match Extractor Type.

  1. Pattern Match is selected as the Extractor Type
  2. The Value Pattern is entered here.
    • In this case, the regex pattern \d{1,3}(,\d{3}){0,2}\.\d{2} matches decimal values from 999,999,999.99 to 0.00.
  3. The Prefix Pattern is entered here.
    • Here, an optional space padded dollar sign.
  4. The Suffix Pattern is entered here.
    • The [^%] matches anything not a percent sign, throwing out percentage values.
  5. The Output Format is formatted here.
    • Unused in this example.
  6. Properties are configured using the "Properties" tab.
    • Unused in this example.

Value-reader-extractor-types-01.png


FYI Prior to Grooper version 2021, the Pattern Match extractor's functionality was delivered in one of two ways:
1. By the Data Format object.
2. Configuring extractor properties and selecting Internal or Text Pattern as the extractor type.

Each of these methods used a "Pattern Editor" UI screen to configure a regular expression. In version 2021, the Data Format object and the Internal and Text Pattern extractor types are gone. The Pattern Match extractor replaces their functionality to return results matching a regular expression pattern.

List Match

The List Match extractor returns values matching one or more items in a defined list. This could be used to match a list of field labels on a form, a list of company names, a list of document titles, or any other list of words or phrases. You can even enable the use of regular expression syntax to match a list of regex patterns.

Just like with Pattern Match, you can enter Prefix and Suffix Patterns to only return an item in the list if a regex pattern also matches before or after. These are useful for anchoring the value you want to return next to some other piece of text. For example, a Prefix Pattern of \n could be used to only return results at the start of a new line because the \n character precedes every new line in the text data. Furthermore, only the list item is returned, not the text matched by the Prefix and Suffix Patterns.

The Output Format allows you to alter the output result for data cleansing or other purposes.

The "Properties" tab allows you to further configure the list matching. Here, you can enable Tab Marking, Fuzzy RegEx mode, filter results based on page location, determine case sensitivity, reference a Lexicon for the list, and more.

In this example, a Value Reader is configured to return the field labels in the top portion of this Closing Disclosure form, using the List Match Extractor Type

  1. List Match is selected as the Extractor Type
  2. The list is entered in the Local Entries editor.
    • In this case, one label after another, line by line.
  3. The Prefix and Suffix Patterns are entered here.
    • Here, a \s character is used to only return items in the list if they are between a whitespace character (\r, \n, \t, \f or a single space character)
  4. Properties are configured using the "Properties" tab.
    • Unused in this example.

Value-reader-extractor-types-02.png

Commonly, a Lexicon is used as the list for List Match. This allows you to point to a Lexicon object rather than manually entering in the list using the Local Entries property. This is excellent when matching large lists of items or when a single list is used by multiple extractors.

  1. For example, the Lexicon highlighted here...
  2. ...has 46 entries for various line items that may or may not appear in the "Loan Cost" section of a Closing Disclosure.

Value-reader-extractor-types-03.png

We could create a Value Reader configured to return any of these items just by pointing to the Lexicon.

  1. This is done in the "Properties" tab.
  2. The Lexicon is referenced using the Vocabulary properties of the List Match extractor.
  3. Using the Included Lexicons property, you can reference one or multiple existing Lexicons.

Value-reader-extractor-types-04.png


FYI Prior to Grooper version 2021, the List Match extractor's functionality was accomplished using the FuzzyList Mode option when configuring a regular expression. As with the Pattern Match extractor, this was delivered in one of two ways:
1. By the Data Format object.
2. Configuring extractor properties and selecting Internal or Text Pattern as the extractor type.

Each of these methods used a "Pattern Editor" UI screen to configure a regular expression. FuzzyList was an option for the regex Mode in the "Properties" tab. In version 2021, the Data Format object and the Internal and Text Pattern extractor types are gone. The List Match extractor replaces their functionality to return results matching a local list or reference lexicon of values.

Label Match

The Label Match extractor is extremely similar to the List Match extractor in that the extractor matches one or more items in a defined list. However, it is designed specifically to work with the Labeling Behavior functionality (also referred to as "Label Sets". It will use the fuzzy extraction settings defined on the Content Model if a Labeling Behavior is enabled. This way, you can have a single set of fuzzy extraction settings, including confidence score settings and fuzzy weighting settings referenced by multiple extractors, rather than configuring those settings for multiple extractors.

  • For more information on fuzzy extraction, visit the Fuzzy RegEx article.

For the Label Match extractor to return a result, two conditions must be met.

  1. The document folder must be classified.
    • In other words, it must have a Document Type assigned to it.
  2. That Document Type must have a Labeling Behavior enabled.
    • Either on the Document Type or, more typically, its parent Content Model.
  1. The Content Model selected here, has enabled a Labeling Behavior.
  2. Labeling Behavior is enabled using the Behaviors property...
  3. ...and added using the collection editor seen here.
    • For more information on the Labeling Behavior, how to enable it, its configuration, and its utility, visit the Labeling Behavior article.
  4. The Label Match extractor will use all the fuzzy extraction and text wrapping settings defined here.

Value-reader-extractor-types-label match-01.png

In this example, a Value Reader is configured to return a small list of field labels on an invoice, using the Label Match Extractor Type

  1. Label Match is selected as the Extractor Type
  2. The list is entered in the Local Entries editor (just like you do with the List Match extractor).
    • Or, you can reference a Lexicon of list items using the "Properties" tab.
  3. The Prefix and Suffix Patterns are entered here.
    • ^|[^\w] is the default Prefix Pattern.
    • $|[^\w] is the default Suffix Pattern.
  4. The document we have selected is classified as an "Invoice" Document Type.
  5. This is a Document Type in the Content Model with the Labeling Behavior enabled.
  6. Upon execution, notice some results are returned with a confidence below 100%.
    • This is due to the fuzzy matching settings configured from the Labeling Behavior. The Label Similarity property was set to 90%. Any items in the list with a fuzzy matching similarity score above 90% are returned. Any falling below 90% (for example the list item CALLER:) are not.
    • Note this means changing the Labeling Behavior settings will impact ALL Label Match extractors for the Content Model's Document Types.

Value-reader-extractor-types-label match-02.png

Word Match

The Word Match extractor is designed for n-gram extraction. An n-gram is "a contiguous sequence of n items from a given sample of text or speech." [1] Typically in Grooper, this refers to extracting words or phrases from a lexicon of terms. Often, this is for the purposes of feature collection for Lexical Classification. The Word Match extractor can capture 1-grams (single words) up to 5-grams (five word phrases). Lexicons are commonly used to dictate a dictionary of allowable returned words. This could be general Lexicon of common English words or a custom Lexicon, such as one with industry specific terms.

FYI

An n-gram is often referred to by a different name depending its n size.

1-grams (single words) - unigrams
2-grams (word pairs) - bigrams
3-grams (three word phrases) - trigrams
4-grams (four word phrases) - four-grams
5-grams (five word phrases) - five-grams

As an additional FYI, four-grams are not called "tetragrams" because the term already has usage as a single word consisting of four letters or characters. "Quadrigram" is occasionally used, but four-gram is the more common terminology. Five-grams are not called "pentagrams", because that already has common usage for a geometric figure.

Just like with Pattern Match, you can enter Prefix and Suffix Patterns to only return an n-gram if a regex pattern also matches before or after. These are useful for anchoring the n-gram you want to return next to some other piece of text. For example, a Prefix Pattern of \n could be used to only return n-grams at the start of a new line because the \n character precedes every new line in the text data. Furthermore, only the n--gram is returned, not the text matched by the Prefix and Suffix Patterns.

The Join Pattern property is unique to the Word Match extractor. This determines how terms of bigrams, trigrams, four-grams, and five-grams can be joined. Most often, terms (or grams) are simply joined by a single space, as in the bigram "first second". If you leave this property blank, Grooper will assume n-grams are always separated by a single space. However, you may want to include n-grams that are separated by other characters. For example hyphenated words, as in "first-second". The Join Pattern allows you to enter a regular expression for the allowable characters between two grams. For example, a Join Pattern of [ -] would allow for a single space or hyphen to be between each term, matching "first second" as well as "first-second".

The Output Format allows you to alter the output result for data cleansing or other purposes.

The "Properties" tab allows you to further configure the n-gram matching. Most importantly, the n-gram size is set here as well as any Lexicon used to lookup against the returned values. You can also enable Tab Marking, Fuzzy RegEx mode, filter results based on page location, determine case sensitivity, and more.

In this example, a Value Reader is configured to return bigram field labels, using the Word Match Extractor Type.

  1. Word Match is selected as the Extractor Type
  2. The Word Pattern is entered here.
    • The regex pattern entered here is used to match each single gram in the n-gram. The default pattern \p{L}+ matches any combination of letter characters in any language of any length. In most cases, this pattern will perfectly suit your n-gram extraction needs. However, you can alter this pattern if you need. For example, [a-zA-Z]+ is a very similar pattern that could be used to match English only words, as it does not include characters of foreign scripts. For example, it would not match Greek characters, such as Ω, where \p{L}+ would.
  3. The Prefix Pattern is entered here.
    • In this case, the pattern entered will only match n-grams if they are preceded by a \n \t or beginning of string ^ character.
  4. The Suffix Pattern is entered here.
    • In this case, the pattern entered will only match n-grams if they are followed by a \r \t or end of string $ character.
  5. The Join Pattern is entered here.
    • The pattern here, [ \-] will return n-grams whos grams are separated by a single space character, a backspace, or a hyphen. If left blank, only n-grams whose grams are separated by a single space character are returned.
  6. The Output Format is formatted here.
    • Unused in this example.

Value-reader-extractor-types-05.png

In this case, we also used the "Properties" tab to set the n-gram size to collect bigrams, and only return grams in a English language dictionary.

  1. Navigate to the "Properties" tab.
  2. The Word Lookup property can be used to reference a Lexicon of allowable terms for each gram in the n-gram.
    • Here, we reference the "English Words" Lexicon that ships with every Grooper install in the "Essentials" folder of the Global Resources folder.
  3. The Phrase Size property allows you to specify the size of the n-gram.
    • Here, it is set to 2 to capture bigrams.

Value-reader-extractor-types-06.png


FYI Prior to Grooper version 2021, n-gram extraction configuration was lumped into other regular expression pattern configurations. As with the Pattern Match extractor, this was delivered in one of two ways:
1. By the Data Format object.
2. Configuring extractor properties and selecting Internal or Text Pattern as the extractor type.

Each of these methods used a "Pattern Editor" UI screen to configure a regular expression. The n-gram size and referenced term lexicons were set in the "Properties" tab. In version 2021, the Data Format object and the Internal and Text Pattern extractor types are gone. The Word Match extractor replaces their functionality to return n-grams in an effort to simplify n-gram extraction setup and distinguish it from general regex pattern matching.

Labeled Value

As the name implies, Labeled Value extractor is designed to return labeled values. A common feature of structured forms is to divide information across a series of fields. But it's not as if you just have a bunch of data randomly strewn throughout the document. Typically, the field's value will be identified by some kind of label. These labels provide the critical context to what the data refers to.

Labeled Value relies on the spatial relationship between the label and the value. Most often labels and their corresponding values are aligned in one of two ways.

1. The value will be to the right of the label.

Value-reader-extractor-types-08.png

2. The value will be below the label.

Value-reader-extractor-types-07.png

Labeled Value uses two extractors itself, one to find the label and another for the value. If the two extractors results are aligned horizontally or vertically within a certain amount of space (according to how the Labeled Value extractor is configured), the value's result is returned.

In this example, a Value Reader is configured to return the "Cash to Close" amount as described on the first page of a Closing Disclosure form, using the Labeled Value Extractor Type.

  1. Labeled Value is selected as the Extractor Type.
  2. The label is returned by the Label Extractor.
    • As an extractor, this could be any of the 14 Extractor Type options (Pattern Match, List Match, Reference, etc). Most commonly it will be either Pattern Match or a Reference to another Value Reader or Data Type.
    • In this case, a Pattern Match extractor is configured to locate the phrase "Cash to Close".
  3. The value is returned by the Value Extractor.
    • Again, this could be any of the 14 Extractor Type options (Pattern Match, List Match, Reference, etc). And commonly as well, it will be either Pattern Match or a Reference to another Value Reader or Data Type.
    • In this case, a Reference is made to the currency extractor Value Reader example in the Pattern Match tab of this article.
  4. The Layout properties in general determine the spatial relationship between the label and the value. The Maximum Distance property is used to determine the distance the value is from the label.
    • Here, the Right maximum distance is set to 1.5 inches. The value can be a maximum of 1.5 inches to the right of the label. The currency value returned by the Value Extractor (80,156.74) is indeed within 1.5 inches to the right of the label returned by the Label Extractor (Cash to Close) in this case. So the value (80,156.74) is returned.
      • Note, there are all kinds of currency values on the page here, but only the value aligned with the label is returned.
    • The Bottom maximum distance is also set to 2 inches in this case. This is actually unnecessary in this example. This would also capture values that are within 2 inches below a label.
  5. In the "Document Viewer" the label extracted by the Label Extractor is outlined in blue.
  6. The corresponding value extracted by the Value Extractor is highlighted in green.

Value-reader-extractor-types-09.png

The Labeled Value extractor has some increased functionality when used in combination with a Labeling Behavior. For more information on the Labeling Behavior and how Labeled Value benefits from it, visit the Labeling Behavior article.
FYI The Labeled Value extractor is extremely similar to the Key-Value Pair Collation Provider for Data Type extractors. Key-Value Pair collated Data Types also use two extractors, one to locate the "key" (the field's label) and one to locate the "value" (the field's value). If those two extractor results are aligned horizontally or vertically (according to the settings configured on the parent Data Type), the value is returned.

Labeled Value can be considered an alternate method to Key-Value Pair extraction. In many ways, it provides a simpler setup to produce the same result (plus added functionality when combined with Label Sets). But it is not a replacement. Key-Value Pair is still a Collation option for Data Type extractors.

Field Match - About

The Field Match extractor allows you to match a value stored in a previously extracted Data Field. Instead of matching a regex pattern and returning a value, this extractor will match the Data Field's text and return a value. Or, you might think of it as the "pattern" is the Data Field's value. You can even utilize capabilities of a Pattern Match extractor. For example, you can use Prefix and Suffix Patterns to anchor the Data Field's value to a specific text based location, just like you can anchor a regex pattern with Prefix and Suffix Patterns with a Pattern Match extractor. You can also parse the Data Field's result with a Parse Pattern.

This could be used for data validation purposes. We've been using Closing Disclosure forms for most of these examples. Let's say the original imported file we imported for processing was named after the loan's borrower. So if John Doe was applying for the loan, the Closing Disclosure would be named "John Doe.pdf". You might want to know if the borrower's name according to the filename actually lines up with the borrower listed on the document. You could do just that with a Field Match extractor.

In this case, we could create a Data Field using a Default Value expression to return the name in the file name.

  1. Here, we have a Data Field named "Borrower From File Name"
  2. The Default Value expression is configured to return everything in the file name except the file path.
    • Match(Folder.AttachedFileName, ".*(?=\.pdf)")
  3. The original PDF file imported into Grooper is named "Eddie Kusick and Ody Boeck.pdf"
  4. Everything in that file name but ".pdf" populates the Data Field.

Value-reader-extractor-types-25.png

Using a second Data Field, we're going to use a Field Match extractor to make sure the information returned by the "Borrower From File Name".

  1. Here, we have a Data Field named "File Name Borrower Validation".
  2. Its Value Extractor property is set to reference a Value Reader using the Field Match extractor.
    • We will go over how this is configured later.
  3. We've also set this Data Field's Required property to True.
    • This will flag the field if no value is populated. If the "Borrower From File Name" Data Field's value is not found on the document, the Field Match extractor will fail to produce a result. So, the field will be flagged, indicating there's some kind of mismatch with the borrower's name according to the file name and the borrower's name according to the document.
There's a specific order of operations that needs to happen before the Field Match extractor can return a result.

First, the Data Field supplying the result to be matched needs to execute. The Field Match extractor can't match the Data Field's value if it hasn't been found yet.

If we attempted to test extraction with this Data Field selected, we would not return a result. We need to go up a level in the data hierarchy to the Data Model. That way the "Borrower From File Name" Data Field will execute, returning a result. Then, the "File Name Borrower Validation" Data Field, using a Field Match extractor will be able to match the result.

Furthermore, Data Fields in a Data Model execute sequentially. You want to make sure the Data Field referenced by the Field Match extractor is listed before the Data Field executing the Field Match extractor.

Value-reader-extractor-types-26.png

  1. We can now test this result on the Data Model
  2. Press the "Test Extraction" button.
  3. The "Borrower From File Name" Data Field returns the borrower's name from the file name.
  4. The "File Name Borrower Validation" Data Field uses a Field Match extractor, matching the "Borrower From File Name" Data Field's" result with text found on the document.

Value-reader-extractor-types-27.png

  1. In the case of this document, the original file's name is "Bad Name.pdf"
  2. Upon testing extraction...
  3. The Field Match extractor fails to produce a result. The result "Bad Name" is not returned by the extractor.
  4. The borrower's name on the document is "Cindi Truwert and Audrey Feak"

Value-reader-extractor-types-28.png

Field Match - Configuration

But how do you build a Field Match extractor? In this example, a Value Reader is configured to return borrower's name on a Closing Disclosure form if it matches a Data Field returning the borrower's name from the native file's name, using the Field Match Extractor Type. This is the Value Reader using the Field Match extractor described in the About section above.

  1. Field Match is selected as the Extractor Type.
  2. The very first thing you want to do is reference the Data Field whose value you're matching on the document. This is done in the "Properties" tab.
  3. Select the Field property.
  4. Using the dropdown menu, choose the Data Field you want to match.


As far as the example in the About section above goes, that's it. That's all we did to get the results seen in this example. If the value returned by the "Borrower From File Name" Data Field is found on the document, it will be returned. If not, not result will be returned.

Value-reader-extractor-types-29.png

However, let's look at a couple other things you may want to consider when using the Field Match extractor.

  1. Switch to the "Expression Editor" tab.
  2. The Value Pattern in this case is optional. This can act as a "fall back" pattern if the Data Field's result is not matched. Grooper will prioritize returning the Data Field's value if it matches it on the document. However, if it does not, and the regex entered in the Value Pattern does match something, the Value Pattern's result will be returned.
  3. The Parse Pattern will parse the Data Field's result using a regular expression pattern.
    • For example, the regex pattern here [A-Z][a-z]+ [A-Z][a-z]+ would cause the extractor to return "Eddie Kusick" instead of "Eddie Kusick and Ody Boeck".
    • CAUTION!!! Grooper's regex is case insensitive by default in most cases. Not so with the Parse Pattern. The Parse Pattern's regex is always case sensitive.
  4. The Prefix and Suffix Pattern can be used to anchor the result to another regex pattern (just like a Pattern Match extractor)
    • Here, we really want to make sure the name in the file name matches the borrower's name. The pattern Borrower\s will cause the Field Match extractor to only return the Data Field's value if it is preceded by this Prefix Pattern.
Remember, there's a specific order of operations that needs to happen before the Field Match extractor can return a result.

First, the Data Field supplying the result to be matched needs to execute. The Field Match extractor can't match the Data Field's value if it hasn't been found yet.

If we attempted to test extraction using the "Test Single" button, we would not return a result. We'd have to go to the Data Model to test this extractor. That way the "Borrower From File Name" Data Field will execute, returning a result. Then, the "File Name Borrower Validation" Data Field, using this Field Match extractor will be able to match the result.

Furthermore, Data Fields in a Data Model execute sequentially. You want to make sure the Data Field referenced by the Field Match extractor is listed before the Data Field executing the Field Match extractor.

Value-reader-extractor-types-30.png


OMR Extractors

OMR stands for "optical mark recognition". Many structured forms utilize checkboxes in order to detail information. Filling out a form is much quicker if you can just check a box to indicate a choice from a list of options rather than printing a response. It also makes it easy on you if you're presented the list of possible options, even if that is as simple as checking a box next to "Yes" or "No".

However, checkboxes don't translate to a text character when recognizing a document's text data through OCR or native next extraction. In order for Grooper to understand if a checkbox is checked, it must digitally recognize the box and its "check state", checked or unchecked. OMR is this digital process of recognizing checkboxes and their check states.

In previous versions of Grooper (pre 2021), checkboxes and their check states were only determined from layout data detected by a Box Detection or Box Removal IP Command and saved to the Batch Page (or in some cases Batch Folder) of a document. Layout data collection is still important for the three OMR extractors.

Improvements have been made to the Labeled OMR extractor allowing it to return labels next to checkboxes without first obtaining the document's layout data. However, it will still use detected checkbox information in the layout data, if present. In most cases, this results in Labeled OMR being the simplest and most effective OMR extractor. Furthermore, it is the only OMR extractor that is capable of extracting radio buttons and their "press state" as checkboxes (Radio buttons are not detectable by Box Detection or Box Removal as they are not boxes).

Ordered OMR and Zonal OMR exist as options where Labeled OMR fails to produce the desired result. However, they typically require more configuration and checkbox information from the layout data is required in order to return a result.

Obtaining Layout Data

Layout data is visual information on a document obtained during an image processing operation. This includes checkboxes and their check states, line locations, and barcodes. Image processing in Grooper is primarily performed to clean up a document in order to obtain better OCR results from printed text, but it's also used to obtain this layout data. This is controlled by creating an IP Profile, which is a step by step list of IP Commands, each one performing a different image processing operation. This includes layout data collecting IP Commands, such as Box Detection and Box Removal.

The IP Profile can be executed permanently, affecting the archival export of the document, or temporarily, reverting back to the original image after OCR is performed.

  • For permanent image processing, the IP Profile is executed by the Image Processing activity.
  • For temporary image processing, the IP Profile is executed by the Recognize activity.
    • The Recognize activity obtains a document's text data via OCR for scanned or image-based documents or native text extraction for digital documents with native machine readable text already present.
      • For scanned or image-based documents, the IP Profile is referenced in the OCR Profile used.
      • For digital documents, you aren't performing OCR. Machine readable text is already present as part of the document's content and Recognize extracts that native digital text. However, the Alternate IP property can be used to reference an IP Profile containing layout data collecting IP Commands.

In either case, if the IP Profile contains layout data collecting IP Commands, layout data will be stored on the processed Batch Page object's "Grooper.Layout.json" file. For example, Box Detection and Box Removal will store checkbox locations and their check states (either checked or unchecked). The OMR extractors will then use this layout data to return labels next to checked boxes.

  1. Here, we have created an IP Profile named "Layout Data".
  2. This IP Profile's second step uses the Box Removal IP Command
    • Note: Whether using Box Detection or Box Removal Grooper will detect checkboxes and store their locations and check states (either checked or unchecked). Box Removal will digitally remove the checkbox from the page's image whereas Box Detection will not.
  3. The box detection settings are determined by the General Settings
  4. In the "Image Diagnostics" panel, the Execution Log of the Box Removal folder will show you the results of the Box Removal command.
  5. Here, you can see all the detected checkboxes, their locations, and their check states.
  6. The Boxes diagnostic image will show you visually where checkboxes are on a page. Detected unchecked boxes will be highlighted in red. Checked boxes are highlighted in green.

Value-reader-extractor-types-16.png

FYI: Layout Data Verification

Most often, a Box Removal command is executed by a temporary IP Profile during the Recognize activity. However whether executed during Image Processing or Recognize, either way will save the checkbox information to a Batch Page object's "Grooper.Layout.json" file.

You can verify this with the "Files" tab of the "Advanced" tab when selecting a Batch Page or Batch Folder processed with a Box Detection command.

  1. Select the processed Batch Page in the node tree.
  2. Navigate to the "Advanced" tab.
  3. Navigate to the "Files" tab.
  4. Select the "Grooper.Layout.json" file.
    • If this file is not present, either no layout data was detected by the steps in an IP Profile or the IP Profile was not executed yet.
  5. The detected checkbox information is stored in the JSON file in a way Grooper can quickly access.

Value-reader-extractor-types-17.png

Labeled OMR

The Labeled OMR extractor is designed to be the easiest OMR extractor to set up. In most cases, it's as simple as define the list of labels next to the checkboxes, determine if multiple boxes can be checked, just one, or if a checked box evaluates to a binary true/false value, and the label (or labels) next to the checked boxes are returned.

The Labeled OMR has just two properties necessary for its configuration: Label Extractor and Mode

The Label Extractor serves to return any labels next to a checkbox. In many cases, this is a simple as using a List Match extractor, entering a list of the text labels on the document.

The Mode property corresponds with how the checkboxes behave on the document. Can just one checkbox out of many be checked? Can multiple? Is there just one checkbox where it being checked means one thing and unchecked another? This can be one of three options: CheckOne, CheckMulti or Boolean

  • CheckOne will target multiple checkboxes but presumes only one may be checked. This is for when documents present a list of options, only one of which may be chosen.
  • CheckMulti will target multiple checkboxes and presumes any number of them may be checked. This is for when documents present a list of options, but any one of them can be chosen.
    • All labels are returned as a single concatenated result. This results may be separated by a Separator String. For example a , could be used to create a comma separated list of checked values.
  • Boolean targets only a single checkbox, presuming the checkbox represents a Boolean "true or false" answer. This is for when documents present a single checkbox where checking the box indicates one thing and leaving it unchecked means another.
    • By default the result will evaluate to either "True" or "False" but this can be altered using the Value If Checked and Value If Unchecked properties.

In this example, a Value Reader is configured to return the type of loan applied for on a "Closing Disclosure" form, using the Labeled OMR Extractor Type.

  1. Labeled OMR is selected as the Extractor Type.
  2. The Label Extractor is configured to return the checkbox labels.
    • Here, we used a simple List Match extractor with our three loan types in its Local Entries list (We're ignoring the fill-in "Other" option for the sake of simplifying this example).
      Conventional
      FHA
      VA
    • List Match will be the most common extractor configuration. However, you have full access to every different extractor type. You could use Pattern Match or reference a Data Type, whatever works for your document set. The important thing to keep in mind is you need to return a single result for each individual label. Here, we have three checkbox labels. The Label Extractor thus returns three results, one for each label.
  3. Configure the Mode property according to how the checkboxes behave on the document.
    • Here, there can only be one type of loan for each individual Closing Disclosure. You'd never have a home loan that is both a conventional loan and an FHA loan. So, it is set to CheckOne.
  4. In the "Document Viewer", labels are outlined in blue and detected checkboxes are highlighted in green.
  5. The label next to the checked box is returned.
    • Technically, for CheckOne mode, all these labels are returned, but the checked label is returned at 100% confidence where the unchecked ones are returned at 0%. This extractor will always return the most confident value first which is ultimately what we want.
    • Note: If multiple boxes are checked with CheckOne selected, all labels will return with 0% confidence.

Value-reader-extractor-types-18.png


FYI In previous versions of Grooper, you were required to configure which direction the checkbox was from the label (North, South, East or West).

As of version 2021, this is no longer required. Labeled OMR will now look for the closest box within an inch of the label in any direction.


On Layout Data and Labeled OMR

Labeled OMR is unique among the OMR extractors in that Box Detection layout data is not necessary in order for it to function. It will use that data if present, but will also attempt to determine if a checkbox is next to a label if it is not.

This is exceptionally useful for non-standard checkboxes difficult or impossible for Box Detection to detect. For example, radio buttons. Radio buttons are just circles. If the button is checked, it has a dot inside it, but otherwise just a circle. Box Detection only detects boxes, squares and rectangles. There's no way a radio button's location and check state will be stored in a documents layout data file.

Not to fret! New improvements to the Labeled OMR extractor in version 2021 allow it to detect radio buttons and other non-standard checkboxes.

Here, the Labeled OMR extractor is unchanged from the example described above.

  1. However, this document uses radio buttons instead of checkboxes to detail the Closing Disclosure's loan type.
  2. In both cases, the correctly checked label is returned.

Value-reader-extractor-types-19.png

Ordered OMR

The Ordered OMR extractor is a little more complicated to set up than Labeled OMR but can be used in cases where Labeled OMR is not producing the desired results. However, checkbox data must be present in the document's layout data. Ordered OMR will not function without this data present before executing. Furthermore, Ordered OMR assumes the checkboxes will be ordered one after the other either vertically or horizontally along a single line. For documents with sections of checkboxes broken up into multiple columns (for vertically ordered checkboxes) or multiple lines (for horizontally ordered checkboxes), multiple Ordered OMR extractors may be necessary (one for each column or line of checkboxes).

For Ordered OMR you will indicate where boxes are on a document by drawing a rectangular zone around the checkboxes. All checkboxes must fall within the drawn zone (This distinguishes Ordered OMR from Zonal OMR. For Zonal OMR, a single zone is drawn for each individual checkbox). This zone is configured using the Location property. This can be one of four options:

  1. Fixed Region - This option is the simplest to set up. As the name implies, the rectangular zone will be fixed on the page. It will stay in the same coordinates for every document. All you need to do is draw the box around the checkboxes.
  2. Relative Region - Instead of setting the zone in a fixed location for every document, the Relative Region mode will anchor the zone to a text label on the document. The zone's position will change relative to the label's position on the document, but will still have the same drawn dimensions.
    • This option is useful to overcome issues arising during scanning printed documents. Slight variations can occur as to where a value is when printing or scanning a document, even for very structured documents. This can cause problems when drawing a single fixed region for the zone. However, if you can anchor the zone off an extractable text value, the zone's position will shift according to that anchor's position.
  3. Text Region - The Text Region option creates a rectangular zone using the logical boundaries of an extraction result. This can create the zone within the boundaries of the extractor's result.
    • This can also be configured to provide results in a similar way the Relative Region option does, using text anchors located by an extractor to position the extraction zone's location. This means both methods can be used to position the zone relative to a point from document to document. The main difference is in how the zone is drawn.
  4. Shape Region - The Shape Region option is extremely similar to the Text Region option. However, instead of using text to anchor the extraction zone, it uses a shape detected from a Shape Detection or Shape Removal IP Command.
    • This is the least common method used.

Ordered OMR must also have its Mode property configured. This property behaves the same as Labeled OMR. It determines how many checkboxes should be checked for the checkboxes falling within the rectangular zone. This can be one of three options:

  • CheckOne will target multiple checkboxes but presumes only one may be checked. This is for when documents present a list of options, only one of which may be chosen.
  • CheckMulti will target multiple checkboxes and presumes any number of them may be checked. This is for when documents present a list of options, but any one of them can be chosen.
    • All labels are returned as a single comma-separated result.
  • Boolean targets only a single checkbox, presuming the checkbox represents a Boolean "true or false" answer. This is for when documents present a single checkbox where checking the box indicates one thing and leaving it unchecked means another.

Ordered OMR is different from Labeled OMR in that the output values are not located by a Label Extractor. Instead, you will use the Output Values property to list each checkbox's corresponding value. Ordered OMR presumes the checkboxes will be stacked on top of each other or next to each other. They will either be stacked on top of each other in a single column, or they will be ordered next to each other across a single horizontal line. Using the Output Values property, you will enter a comma-separated list of the checkboxes' labels from top to bottom or left to right.

  • When selecting the Boolean Mode, only two values may be entered.

The Flow Direction property determines how they are ordered, either Vertical or Horizontal. Vertical is appropriate for boxes stacked on top of each other. Horizontal is appropriate for boxes next to each other along a horizontal line.

In this example, a Value Reader is configured to return the "This estimate includes" options for a "Closing Disclosure" form, using the Ordered OMR Extractor Type.

  1. Ordered OMR is selected as the Extractor Type.
  2. For any Ordered OMR configuration you must configure the Location property. This determines where the zone is placed on each document. All checkboxes should fall in this zone.
    • In this case, we've selected Fixed Region and drawn a rectangle on the page. These forms are highly structured. We can assume the checkboxes will always fall within the same rectangular coordinates.
  3. You can see the drawn zone in green in the "Document Viewer" pane.
  4. Configure the Mode property according to how the checkboxes behave on the document.
    • Here, multiple boxes can be checked. The estimate could include property taxes, homeowner's insurance, other costs, or any combination of the three. So, it is set to CheckMulti
  5. Each label is entered as a comma-separated list in the Output Values property.
  6. Whether the checkboxes are ordered vertically or horizontally is set by Flow Direction property.
    • Here, the checkboxes are stacked on top of each other. So, it is set to Vertical.

Value-reader-extractor-types-20.png

Zonal OMR

For Zonal OMR a rectangular zone must be drawn around the location of each individual checkbox. This typically makes Zonal OMR the most time consuming of the OMR extractors, but may be necessary to target checkboxes on forms whose order is not targetable by Ordered OMR or whose labels cannot be extracted by Labeled OMR. Furthermore, checkbox data must be present in the document's layout data. Zonal OMR will not function without this data present before executing.

Just like Labeled OMR and Ordered OMR, Zonal OMR must also have its Mode property configured. It determines how many checkboxes should be checked for the checkboxes falling within the rectangular zones. This can be one of three options:

  • CheckOne will target multiple checkboxes but presumes only one may be checked. This is for when documents present a list of options, only one of which may be chosen.
  • CheckMulti will target multiple checkboxes and presumes any number of them may be checked. This is for when documents present a list of options, but any one of them can be chosen.
    • All labels are returned as a single comma-separated result.
  • Boolean targets only a single checkbox, presuming the checkbox represents a Boolean "true or false" answer. This is for when documents present a single checkbox where checking the box indicates one thing and leaving it unchecked means another.
    • By default the result will evaluate to either "True" or "False" but this can be altered using the Value If Checked and Value If Unchecked properties.

The Anchor property allows all drawn zones for each checkbox to be anchored to an extractible text result. This is similar to how the Relative Region Location option of Ordered OMR can anchor a zone to a relative location on the page rather than a fixed position that remains the same for each and every document.

  • This option is useful to overcome issues arising during scanning printed documents. Slight variations can occur as to where a value is when printing or scanning a document, even for very structured documents. This can cause problems when drawing a single fixed region for the zone. However, if you can anchor the zone off an extractable text value, the zone's position will shift according to that anchor's position.

The rectangular zones are drawn using the OMR Boxes property. This will bring up a collection editor to draw registration zones for each checkbox. Here, you will also enter the output result for each OMR zone. This editor also allows you to anchor zones to a text anchor for each individual zone, meaning you can anchor a single zone here (as well as anchoring the collection of OMR zones using the Anchor property described above).

In this example, a Value Reader is configured to return the type of loan applied for on a "Closing Disclosure" form, using the Zonal OMR Extractor Type.

  1. Zonal OMR is selected as the Extractor Type.
  2. Configure the Mode property according to how the checkboxes behave on the document.
    • Here, there can only be one type of loan for each individual Closing Disclosure. So, it is set to CheckOne.
  3. The Anchor property is used to anchor the collection of OMR zones to a text result on the document.
    • Note: Using an Anchor is optional. You can choose to configure an anchor extractor or not. In many cases, using one will increase the accuracy of your results, but it is not required. Furthermore, using the Anchor subproperties you can choose to make the anchor required to perform extraction or not if it fails to produce a result.
  4. We chose to anchor the collection of OMR zones to the text "Loan Type" in this case.
  5. One rectangular zone is drawn around each checkbox using the OMR Zones collection editor.

Value-reader-extractor-types-21.png

  1. Add one zone for each checkbox using the "Add" button.
  2. Use the Value property to enter the output result.
  3. Use the Bounds property to enter the zone's coordinates.
    • Select this property and press the ellipsis button at the end to lasso the zone with your cursor.

In this case, we have four checkboxes. So, we have added four items to our collection list here, one for each type of application.

Value-reader-extractor-types-22.png

Remember how we ignored the "Other" application type for our Labeled OMR example? The text entered for the "Other" application type could be anything, which makes it more difficult to extract a label. The text could be anything under the sun.

  1. With Zonal OMR, we didn't have to enter the label. We just drew a zone around the corresponding checkbox.
  2. Its output result is what we entered in this OMR box's Value property, "Other".

Note: This doesn't mean you couldn't use Labeled OMR and still successfully extract the text label here (even though it could be anything!). It would just require a more complex extractor than we used for our earlier example.

Value-reader-extractor-types-23.png

Barcode Extractors

Both the Find Barcode and Read Barcode extractors will return a barcode's encoded value as its result. Barcodes encode information according to the "symbology" used. Grooper has the ability to detect and read 29 different symbologies, including Code128, Code39, US postal barcodes, QR codes, and UPC barcodes.

For both the Find Barcode and Read Barcode extractors, you must specify which barcode symbology you're looking for. The difference between the two Extractor Types is when Grooper's barcode detection runs before the value is extracted.

  • Find Barcode - The barcode's value must be obtained by a Barcode Detection or Barcode Removal IP Command in an IP Profile before the extractor executes.
    • When the Read Barcode extractor executes on a document folder, it looks for the barcode value in the Batch Folder's layout data file or it's child Batch Pages' layout data files. If present, the barcode's value is returned.
  • Read Barcode - This extractor executes barcode detection every time the extractor executes. This means a layout data file is not necessary for the extractor to return a barcode value.
    • No layout data required. No previous Barcode Detection or Barcode Removal IP Command is necessary.
    • The extractor is configured to locate a barcode in nearly the same way you configure the Barcode Detection IP Command to detect the barcode.

Your decision as to which one you choose to use will largely be based on when you want to expend the processing time required to detect the barcode. Read Barcode performs barcode detection when the extractor executes. For Find Barcode, the extractor presumes the barcode was already detected and present in the document folder's layout data. Layout data is stored in a file named "Grooper.Layout.json" when an IP Command such as Barcode Detection detects layout data on a Batch Page (or in certain cases a Batch Folder).

Find Barcode simply uses the "Grooper.Layout.json" file to return the barcode value. Barcode detection takes time. The processing time to read a document's "Grooper.Layout.json" file might take 3 milliseconds. The processing time to detect a barcode on a document takes significantly more processing power. Let's say 300 milliseconds. Find Barcode just finds the barcode value in the "Grooper.Layout.json" file, taking 3 ms to return a value. On the other hand, Read Barcode must first read the barcode before returning the value, taking 300 ms to return a value. Therefore, Find Barcode has a faster runtime execution than Read Barcode.

However, this presumes the barcodes value is already present in the "Grooper.Layout.json" file. So, you're still going to have to find the barcode at some point in the Batch Process. You just have to ask yourself if you want to spend that processing time during extraction or ahead of time in some kind of image processing operation (either the Image Processing activity for permanent image processing or the Recognize activity for temporary image processing).

Read Barcode

In this example, a Value Reader is configured to return the date encoded in a barcode, using the Read Barcode Extractor Type.

  1. Read Barcode is selected as the Extractor Type.
  2. For any configuration, you must define which barcode symbologies are used in your document. This is configured using the Detection Settings properties.
  3. You must choose which barcode reader is used to detect the barcode symbology. There are four Reader properties available: Standard Reader, 1D Reader, 2D Reader, and Postal Reader
    • The 1D Reader , 2D Reader, and Postal Readers work faster and generally provide better results than the Standard Reader but are limited in the barcode symbologies they support.
    • In this case, the barcode uses the "Code 39" symbology, which is supported by the 1D Reader. Thus, the 1D Reader is Enabled.
  4. Use the Barcode Symbologies property to select which barcode symbology you wish to detect.
    • In this case, we have selected Code39.
    • Note: You may select multiple barcode symbologies. However, be careful as this can provide false positive results. Certain barcodes use only slightly different symbologies than others.
  5. Optionally, you may configure the Value Pattern property to write a regular expression pattern to validate the barcode's value.
    • In this case a simple regex matching a date format \d{1,2}/\d{1,2}/(\d{4}|\d{2}) is used to validate the returned value is a date.
    • Note: The regex pattern written here will not parse the value at all, just validate it. If the regex matches any portion of the barcode's value, the whole value is returned.
  6. When the Value Reader executes, the barcode is detected and its encoded value is returned.

Value-reader-extractor-types-12.png


Configuration Prereqs - Obtain Layout Data

In the case of Find Barcode, the barcode must previously be detected by a Barcode Detection or Barcode Removal IP Command with its value stored in the document's layout data.

  1. Here, we have created an IP Profile named "Layout Data".
  2. This IP Profile's first step uses the Barcode Removal IP Command
    • Note: Whether using Barcode Detection or Barcode Removal Grooper will detect the barcode and store its value. Barcode Removal will digitally remove the barcode from the page's image whereas Barcode Detection will not.

From this point, barcode detection is configured exactly the same way as seen in the Read Barcode example.

  1. For any configuration, you must define which barcode symbologies are used in your document. This is configured using the Detection Settings properties.
  2. You must choose which barcode reader is used to detect the barcode symbology. There are four Reader properties available: Standard Reader, 1D Reader, 2D Reader, and Postal Reader
    • The 1D Reader , 2D Reader, and Postal Readers work faster and generally provide better results than the Standard Reader but are limited in the barcode symbologies they support.
    • In this case, the barcode uses the "Code 39" symbology, which is supported by the 1D Reader. Thus, the 1D Reader is Enabled.
  3. Use the Barcode Symbologies property to select which barcode symbology you wish to detect.
    • In this case, we have selected Code39.
    • Note: You may select multiple barcode symbologies. However, be careful as this can provide false positive results. Certain barcodes use only slightly different symbologies than others.
  4. In the "Image Diagnostics" panel, the Execution Log of the Barcode Detection folder will show you the results of the Barcode Removal command.
  5. Here, you can see a "Code30" barcode was found, its positional boundaries, and its encoded value ("01/12/2020").

Value-reader-extractor-types-13.png

Before executing the Find Barcode extractor, the IP Profile with the Barcode Removal or Barcode Detection IP Command must be executed to save the barcode value in the document's layout data.

An IP Profile can be executed in one of two ways:

  1. During the Image Processing activity for permanent image processing.
  2. During the Recognize activity for temporary image processing.
    • For OCR of image based documents, the 'IP Profile is assigned to the OCR Profile assigned to the Recognize activity.
    • For native text extraction of digital document, an IP Profile can be assigned to the Alternate IP property of the Recognize activity.

FYI: Layout Data Verification

Most often, a Barcode Removal command is executed by a temporary IP Profile during the Recognize activity. However whether executed during Image Processing or Recognize, either way will save the barcode value to a Batch Page object's "Grooper.Layout.json" file.

You can verify this with the "Files" tab of the "Advanced" tab when selecting a Batch Page or Batch Folder processed with a Barcode Detection command.

  1. Select the processed Batch Page in the node tree.
  2. Navigate to the "Advanced" tab.
  3. Navigate to the "Files" tab.
  4. Select the "Grooper.Layout.json" file.
    • If this file is not present, either no layout data was detected by the steps in an IP Profile or the IP Profile was not executed yet.
  5. The detected barcode information is stored in the JSON file in a way Grooper can quickly access.

Value-reader-extractor-types-14.png

Find Barcode

In this example, a Value Reader is configured to return the date encoded in a barcode, using the Find Barcode Extractor Type. This document's layout data was previously obtained with the IP Profile described above during the Recognize activity.

  1. Find Barcode is selected as the Extractor Type.
  2. Since the barcode's value is already stored in the document's layout data file, all you need to do is define what barcode symbology you're looking for.
    • In this case, we're looking for a Code39 barcode value.
    • Note: You may also choose All to return any and all barcode values in the "Grooper.Layout.json" file.
  3. Because the barcode was detected before the extractor executes, it runs very quickly.
  4. Rather than digitally scanning the page for a barcode, it simply returns the already obtained information stored in the layout data file.

Value-reader-extractor-types-15.png

Zonal Extractors

Read Zone

The Read Zone extractor allows you to extract text data in a rectangular region (called a "extraction zone" or just "zone") on a document. This can be a fixed zone, extracting text from the same location on a document, or a zone relative to an extracted text anchor or shape location on the document.

Read Zone is useful for extracting data from highly structured documents. If a document's structure is fixed, it's going to have the same fields in the same physical location from one document to the next. The Closing Disclosure forms we've been looking at in this article are themselves fairly fixed. For example, the "Loan Amount" listed on the first page is more or less in the same spot for every single Closing Disclosure. The dollar amount itself may change, but there's only so much room that amount can take up on the document.

If you can draw a rectangle around the value you want to extract, and the value falls within the boundaries of that rectangle for every single document, extraction may be as simple as just extracting the text in the rectangle's location. This is referred to as "zonal extraction". You draw a zone where the value exists on the page and return the text data falling in the zone.

Read Zone has a few different options for where the box is placed using the Location property. This can be one of four options:

  1. Fixed Region - This option is the simplest to set up. As the name implies, the extraction zone will be fixed on the page. It will stay in the same coordinates for every document. All you need to do is draw the box where you want to extract data.
  2. Relative Region - Instead of setting the extraction zone in a fixed location for every document, the Relative Region mode will anchor the zone to a text label on the document. The extraction zone's position will change relative to the label's position on the document, but will still have the same drawn dimensions.
    • This option is useful to overcome issues arising during scanning printed documents. Slight variations can occur as to where a value is when printing or scanning a document, even for very structured documents. This can cause problems when drawing a single fixed region for the extraction zone. However, if you can anchor the zone off an extractable text value, the zone's position will shift according to that anchor's position.
  3. Text Region - The Text Region option creates an extraction zone using the logical boundaries of an extraction result. This can return all the text falling within the boundaries of the rectangle around the extractor's result.
    • This can also be configured to provide results in a similar way the Relative Region option does, using text anchors located by an extractor to position the extraction zone's location. This means both methods can be used to position the zone relative to a point from document to document. The main difference is in how the zone is drawn.
  4. Shape Region - The Shape Region option is extremely similar to the Text Region option. However, instead of using text to anchor the extraction zone, it uses a shape detected from a Shape Detection or Shape Removal IP Command.
    • This is the least common method used.

The Read Zone extractor can optionally re-process the text data with an OCR Profile. This can be used to perform custom OCR on the extracted text.

The text in the zone can also be itself extracted by a Value Extractor. This allows you to break up the document into a smaller portion and run an extractor on just the zone instead of the full document. Essentially, you use the Read Zone extractor to create a smaller data instance (from the larger document data instance) and use its Value Extractor property to return data from the smaller data instance.

In this example, a Value Reader is configured to return the "Loan Amount" value as described on the first page of a Closing Disclosure form, using the Read Zone Extractor Type.

  1. Read Zone is selected as the Extractor Type
  2. For any Read Zone configuration you must configure the Location property. This determines where the extraction zone is placed on each document.
    • In this case, we configured the Relative Region option. We're using the text label "Loan Amount" as the anchor for the drawn extraction zone. You fully configure whichever Location mode you choose by expanding and configuring its sub-properties.
      • The extracted anchor is seen in the "Document Viewer" outlined in blue.
      • The extraction zone is seen highlighted in green. Any text falling within that green box is returned as the result.
  3. The Output Full Region property is very handy. It doesn't change the result at all, it just shows the full size of the drawn zone on the page. This is extremely useful when testing and configuring the Read Zone extractor.
  4. If Output Full Region were set to False, only the text ($ 159,432.62) would be highlighted, not the full drawn zone seen here.

Value-reader-extractor-types-10.png

Highlight Zone

The Highlight Zone extractor is unique in that it doesn't actually extract anything at all!

So, why use it? Highlight Zone can be useful for quickly calling attention to Data Fields requiring manual validation during a Data Review activity. For example, handwritten fields on a form are unlikely to be recognized by OCR. OCR is designed to read machine printed characters. While there are some recent advancements in handwriting detection (Microsoft Azure's OCR service is particularly adept at recognizing handwritten characters), OCR engines tend to fail at recognize handwriting. In the case of fields written in by hand, you will likely need a human being to enter that information during Data Review.

However, if those fields are in the same spots on the document, you can use Highlight Zone to draw the data entry clerk's attention to its location on the page, saving them time and you money.

The highlighted zones are drawn using the exact same Location methods and property configurations as Read Zone.

In this example, a Value Reader is configured to highlight a signature field, using the Highlight Zone Extractor Type.

  1. Highlight Zone is selected as the Extractor Type.
  2. Just like with Read Zone, you must configure the Location property. This determines where the highlighted zone is placed on each document.
  3. In this case, we used Fixed Region. The drawn rectangle has the same coordinates (drawn using the Bounds property) for every single document.
  4. We also put a Page Filter on the zone. Since the signature page is always on page 5 of the document, we set it to 5.
  5. This could be used to draw a data entry clerk's attention to the signature page to quickly verify if the document is signed.

Value-reader-extractor-types-11.png

The Reference Extractor

The Reference extractor is just an extractor that's returning the results of another extractor object in the node tree. You can use the Reference Extractor Type to reference any of the three extractor objects: Data Types, Value Readers, and Field Classes. This can be useful to keep you from duplicating your efforts over and over again. For example, tf you have a variety of different extractors needing to return a currency value, don't create a new currency extractor every time you need to return a currency value. Just create a single currency value and use the Reference 'Extractor Type to use it and re-use it over and over.

In this example, a Value Reader is configured to return the results of the currency extractor we created for the Pattern Match example (named "Pattern Match - Currency), using the Reference Extractor Type.

  1. Reference is selected as the Extractor Type.
  2. Use the Extractor property to point to an extractor in the node tree.
    • You can reference a Data Type, Value Reader, or Field Class.
  3. When the Reference extractor executes, the results of the referenced extractor are returned.
  4. You can see the name of the extractor returning the value in the "Name" column.
  5. In this case, the referenced "Pattern Match - Currency" Value Reader created earlier.

Value-reader-extractor-types-24.png

Value Readers vs Data Types

Before version 2021, the Data Type extractor object was considered the bread and butter of data extraction. For many Grooper users a "data extractor" and a Data Type are synonymous. In some ways, this may remain to be the case. However, the introduction of Value Reader object was (at least in part) designed to conceptually distinguish a Data Type from a "general purpose extractor". Instead, we should emphasize the Data Type's primary function in Grooper: Data Collation.

  • A Value Reader is a Grooper object designed for data extraction, returning the initial data set from the document.
  • A Data Types is a Grooper object designed for data collation, processing and returning the final data set from the document.
    • Even choosing not to collate results (using the Individual Collation Provider) is still a method of collating data. It's just the simplest way of collating data.

Because data collation is so important for a variety of extraction techniques it's almost natural to equate collated data with extracted data. But, there's really two parts of what's going on. First, values are extracted from the document's text data then they are collated and finally returned by the Data Type.

While both Value Readers and Data Types are considered "extractors", they really have two different jobs as far as Grooper is concerned. One way to think of this is a Value Reader is a "data finder" while a Data Type is a "data manipulator". It is a Value Reader's job to locate and return data from a document. It's a Data Type's job to take that data and organize it, manipulate it or impose constraints on what counts as valid data.

For example, a Data Type can be configured to only return values that are stacked on top of one another in a vertical array (using the Array Collation Provider). Any values not aligned with each other vertically are tossed out of the final result. The results are collated into a vertical array. Before the results can be collated, they have to be found. Only then can it be determined if and how they are organized and manipulated. That's the true job of the Value Reader, finding and returning the initial data set. The Data Type then collates and returns the final data set.

To drive this point home, take a look at the Extractor property of the Data Type.

  1. Here, we have an unconfigured Data Type selected, returning no results.
  2. Select the Extractor property.
  3. Using the drop down menu to expand its property options, these should look very familiar.
    • They are the exact same options as the Extractor Type for a Value Reader. It's almost as if the Data Type's Extractor property is just a local Value Reader configured on the Data Type object itself.
    • The Data Type can't do anything without being supplied results by an extractor. This could be the results the Extractor property's configuration returns, or the results of a child or referenced Value Reader, or the results of a child or referenced Data Type with its own extraction configured.
    • If you boil it down, at some point, whenever you're doing the job of finding data you want the Data Type to collate and return, you're going to be making a Value Reader or configuring a property that behaves very much like a Value Reader.
      • The Value Reader is the data extractor or "data finder".
      • The Data Type is the data collator or "data manipulator".

Value-reader-03.png

Version Differences

2021: Introducing the Value Reader Object

The Value Reader is a new object added in version 2021. It combines the functionality of other objects or properties of objects from previous versions by either supplementing them or replacing them entirely . This includes:

  • The Data Format
    The Data Format object is gone in 2021. Its functionality is replaced mostly through the Pattern Match Extractor Type. Although, the List Match and Word Match Extractor Types comprise some of its functionality as well.
  • The Internal and Text Pattern extractor options
    These extractor options were available to various extractor properties on various objects in Grooper, such as the Value Extractor property of a Data Type. This allowed you to use a "Pattern Editor" to configure regex pattern extraction local to the object (using the same configuration UI as the Data Format object). These options are gone in 2021. Their functionality is also replaced by the Pattern Match, List Match, and Word Match Extractor Types
  • The Fuzzy List extraction mode
    This is also gone in 2021. Previously Fuzzy List was a regex Mode option for Data Formats and Internal and Text Pattern extractors. This functionality is replaced by the List Match Extractor Type.
  • The Labeled OMR, Ordered OMR, Zonal OMR, Find Barcode, Read Barcode, Highlight Zone, and Read Zone extractor options
    Previously, these extractor options were only available to a handful of objects through their property configuration. For example, the Data Field's Value Extractor property. While still available to these properties, they are also available to the Value Reader as an Extractor Type option. This means these extraction methods are available anywhere in Grooper a Value Reader can be referenced, not just to a limited few objects.

Furthermore, three brand new extraction methods were created out of whole cloth in Grooper 2021 and are available to the Value Reader as Extractor Types: Labeled Value, Field Match and Read Substring