Value Reader (Object)

From Grooper Wiki
(Redirected from Value Reader)

This article was migrated from an older version and has not been updated for the current version of Grooper.

This tag will be removed upon article review and update.

This article is about the current version of Grooper.

Note that some content may still need to be updated.

20252024 20232021
Graphic depicting the Grooper Value Reader
Graphic depicting the Grooper Value Reader

quick_reference_all Value Reader objects define a single data extraction operation. You set the Extractor Type on the Value Reader that matches the specific data you're aiming to capture. For example, you would use the Pattern Match Extractor Type to return data using regular expression. You would use a Value Reader when you need to extract a single result or list of simple results from a document.

Value Readers are Grooper's "one stop shop" for data extraction. They return a single or a list of numerical or lexical results from a page or document folder's text data (obtained via OCR or native text extraction from the Recognize activity).

You may download and import the file below into your own Grooper environment (version 2024). This contains a Batch with the example document(s) discussed in this article and a Project containing a Content Model configured according to its instructions.

About

The Value Reader is a new extraction object introduced in Grooper 2021. It is designed to expand on the extractor functionality of Grooper's regular expression pattern matching capabilities to include newer extraction capabilities, such as extracting values next to OMR (optical mark recognition) checkboxes and barcode values. In previous versions, this functionality was split across multiple objects (or properties of multiple objects). The Value Reader extractor combines these disparate functionalities into a single extractor object with increased functionality. This object forms the foundation for extracting information from a document, using a variety of different methods.

  • Do you need to extract a date? A Value Reader can do that!
  • Do you need to extract anything matching a list of values? A Value Reader can do that!
  • Do you need to extract English language unigrams (or bigrams etc)? A Value Reader can do that!
  • Do you need to extract a value from a barcode? A Value Reader can do that!
  • Do you need to extract the label next to a checked checkbox? A Value Reader can do that!

Do you need to find any value at all? You're going to use some kind of configuration of the Value Reader to do it.

Value Readers locate results using a variety of Extractor Types. The very first thing you will do when creating a Value Reader is decide which Extractor Type suits your extraction needs.

  1. With a Value Reader created and selected in the node tree...
  2. You will see the Extractor Type property at the top of the Value Reader's UI.
  3. Use the drop down menu to select which Extractor Type you wish to use.

User Interface

Regardless of the Extractor Type selected the Value Reader UI can be divided into five sections:

  1. The Expressions Window which can be accessed via the "Tester" tab.
    • There are two main tabs you will use: "Value Reader", where the type of Value Reader is chosen, and The "Tester" tab where you can test your configurated Value Reader on the document.
      • Value Reader", where the type of Value Reader is chosen.
      • The "Tester" tab where you can test your configurated Value Reader on the document.
  2. The Extractor Type configuration window.
    • Here, the selected Extractor Type is configured to return data. Depending on the specific Extractor Type selected, this panel will change somewhat. Each Extractor Type has its own set of required and optional properties to extract text from a document.
  3. The "Batch Selector" window.
    • Here, a Test Batch is selected to test the extractor.
  4. The "Document Viewer" window.
    • This provides the user a visual interface with a document folder or page selected in the "Batch Viewer". Notably, results are highlighted in green on the page.
    • You can also switch to the "Text Input" tab to view the text data obtained via OCR or native text extraction from the Recognize activity.
    • The "Diagnostics" tab provides additional information to the Design Studio user about the extraction. This can be a useful troubleshooting tool when configuring various Extractor Types.
  5. The "Results Panel" window.
    • This panel shows you a list of all results returned by the extractor.

Extractor Types

The Extractor Type options fall into one of five categories.

Category Extractor Types Comments

Text Parsing Extractors

  • Pattern Match
  • List Match
  • Label Match
  • Word Match
  • Labeled Value
  • Field Match

These Extractor Types primarily rely on regular expression, lists of values (such as a Lexicon of field labels) or other forms of text parsing to return values.

  • Note: This does not mean other Extractor Types do not or cannot use regular expression or parse text as part of their functionality. Far from it (In very general terms, Grooper's "data extraction" is itself a form of text parsing in one way or another). These Extractor Types just use it more foundationally for their functionality.

LLM Extractors

  • Ask AI

These Extractor Types use "large language models" like ChatGPT to return results via chatbot style "conversations" with documents.

OMR Extractors

  • Labeled OMR
  • Ordered OMR
  • Zonal OMR

These Extractor Types allow you to return values using optical mark recognition. These are useful for extracting values on documents that use checkboxes to detail information.

Barcode Extractors

  • Find Barcode
  • Read Barcode

These Extractor Types allow you to return a value stored in a barcode.

Zonal Extractors

  • Highlight Zone
  • Read Zone

These Extractor Types are used to draw a logical rectangle somewhere on a document and return the text falling inside. These are useful for extracting values on highly structured documents where field values are consistently located on the same position on the page for every document.

The Reference extractor

  • Reference

The Reference option allows you to return the results of another extractor, whether that is a Value Reader, a Data Type, or a Field Class.

Text Parsing Extractors

Pattern Match

The Pattern Match extractor relies on regular expression (regex) pattern matching to return values. This is truly the foundation for almost all data extraction in Grooper. A regex pattern entered in the Value Pattern will run against the selected document, page, or data instance's text data. Matching results will be returned as this extractor's values.

You can also enter Prefix and Suffix Patterns to only return data if the text matched by the Value Pattern also matches a regex pattern before or after. These are useful for anchoring the value you want to return next to some other piece of text. For example, a Prefix Pattern of \n could be used to only return results at the start of a new line because the \n character precedes every new line in the text data. Furthermore, only the data matched by the Value Pattern is returned.

The Output Format allows you to alter the output result for data cleansing or other purposes.

The "Properties" tab allows you to further configure the regex extraction. Here, you can enable Tab Marking, Fuzzy RegEx mode, filter results based on page location, determine case sensitivity, use a Lexicon to perform lookup operations, and more.


In this example, a Value Reader is configured to return currency values, using the Pattern Match Extractor Type.

  1. On the "Value Reader" tab, Pattern Match is selected as the Extractor Type
  2. The Value Pattern is entered here.
    • In this case, the regex pattern \d{1,3}(,\d{3}){0,2}\.\d{2} matches decimal values from 999,999,999.99 to 0.00.
  3. The Prefix Pattern is entered here.
    • Here, an optional space padded dollar sign.
  4. The Suffix Pattern is entered here.
    • The [^%] matches anything not a percent sign, throwing out percentage values.
  5. The Output Format is formatted here.
    • Unused in this example.
  6. Properties are configured using the "Properties" tab.
    • Unused in this example.

List Match

The List Match extractor returns values matching one or more items in a defined list. This could be used to match a list of field labels on a form, a list of company names, a list of document titles, or any other list of words or phrases. You can even enable the use of regular expression syntax to match a list of regex patterns.

Just like with Pattern Match, you can enter Prefix and Suffix Patterns to only return an item in the list if a regex pattern also matches before or after. These are useful for anchoring the value you want to return next to some other piece of text. For example, a Prefix Pattern of \n could be used to only return results at the start of a new line because the \n character precedes every new line in the text data. Furthermore, only the list item is returned, not the text matched by the Prefix and Suffix Patterns.

The Output Format allows you to alter the output result for data cleansing or other purposes.

The "Properties" tab allows you to further configure the list matching. Here, you can enable Tab Marking, Fuzzy RegEx mode, filter results based on page location, determine case sensitivity, reference a Lexicon for the list, and more.


In this example, a Value Reader is configured to return the field labels in the top portion of this Closing Disclosure form, using the List Match Extractor Type

  1. On the "Value Reader" tab, List Match is selected as the Extractor Type
  2. The list is entered in the Local Entries editor.
    • In this case, one label after another, line by line.
  3. The Prefix and Suffix Patterns are entered here.
    • Here, a \s character is used to only return items in the list if they are between a whitespace character (\r, \n, \t, \f or a single space character)
  4. Properties are configured using the "Properties" tab.
    • Unused in this example.


Commonly, a Lexicon is used as the list for List Match. This allows you to point to a Lexicon object rather than manually entering in the list using the Local Entries property. This is excellent when matching large lists of items or when a single list is used by multiple extractors.

  1. For example, the Lexicon highlighted here...
  2. ...has 46 entries for various line items that may or may not appear in the "Loan Cost" section of a Closing Disclosure.


We could create a Value Reader configured to return any of these items just by pointing to the Lexicon.

  1. This is done in the "Properties" tab.
  2. The Lexicon is referenced using the Vocabulary properties of the List Match extractor.
  3. Using the Included Lexicons property, you can reference one or multiple existing Lexicons.

Label Match

The Label Match extractor is extremely similar to the List Match extractor in that it matches one or more items in a defined list. However, it is designed specifically to work with the Labeling Behavior functionality (also referred to as "Label Sets"). It will use the fuzzy extraction and vertical and constrained wrapping settings defined on the Content Model if a Labeling Behavior is enabled. This way, you can have a single, unified set of fuzzy match settings for multiple extractors. Rather than configuring these settings, including the confidence score threshold and fuzzy weighting, for multiple extractors, you can configure them just once when enabling the Labeling Behavior and all Label Match extractors will use them.

  • For more information on fuzzy extraction, visit the Fuzzy RegEx article.

For the Label Match extractor to return a result, two conditions must be met.

  1. The document folder must be classified.
    • In other words, it must have a Document Type assigned to it.
  2. That Document Type must have a Labeling Behavior enabled.
    • Either on the Document Type or, more typically, its parent Content Model.


  1. The Content Model selected here, has enabled a Labeling Behavior.
  2. Labeling Behavior is enabled using the Behaviors property...
  3. ...and added using the collection editor seen here.
    • For more information on the Labeling Behavior, how to enable it, its configuration, and its utility, visit the Labeling Behavior article.
  4. The Label Match extractor will use all the fuzzy extraction and text wrapping settings defined here.


In this example, a Value Reader is configured to return a small list of field labels on an invoice, using the Label Match Extractor Type

  1. On the "Value Reader" tab, Label Match is selected as the Extractor Type
  2. The list is entered in the Local Entries editor (just like you do with the List Match extractor).
    • Or, you can reference a Lexicon of list items using the "Properties" tab.
  3. The Prefix and Suffix Patterns are entered here.
    • ^|[^\w] is the default Prefix Pattern.
    • $|[^\w] is the default Suffix Pattern.
  4. The document we have selected is classified as an "Invoice" Document Type.
  5. This is a Document Type in the Content Model with the Labeling Behavior enabled.
  6. Upon execution, notice some results are returned with a confidence below 100%.
    • This is due to the fuzzy matching settings configured from the Labeling Behavior. The Label Similarity property was set to 90%. Any items in the list with a fuzzy matching similarity score above 90% are returned. Any falling below 90% (for example the list item CALLER:) are not.
    • Note this means changing the Labeling Behavior settings will impact ALL Label Match extractors for the Content Model's Document Types.

Word Match

The Word Match extractor is designed for n-gram extraction. An n-gram is "a contiguous sequence of n items from a given sample of text or speech." [1] Typically in Grooper, this refers to extracting words or phrases from a lexicon of terms. Often, this is for the purposes of feature collection for Lexical Classification. The Word Match extractor can capture 1-grams (single words) up to 5-grams (five word phrases). Lexicons are commonly used to dictate a dictionary of allowable returned words. This could be general Lexicon of common English words or a custom Lexicon, such as one with industry specific terms.

FYI

An n-gram is often referred to by a different name depending its n size.

1-grams (single words) - unigrams
2-grams (word pairs) - bigrams
3-grams (three word phrases) - trigrams
4-grams (four word phrases) - four-grams
5-grams (five word phrases) - five-grams

As an additional FYI, four-grams are not called "tetragrams" because the term already has usage as a single word consisting of four letters or characters. "Quadrigram" is occasionally used, but four-gram is the more common terminology. Five-grams are not called "pentagrams", because that already has common usage for a geometric figure.

Just like with Pattern Match, you can enter Prefix and Suffix Patterns to only return an n-gram if a regex pattern also matches before or after. These are useful for anchoring the n-gram you want to return next to some other piece of text. For example, a Prefix Pattern of \n could be used to only return n-grams at the start of a new line because the \n character precedes every new line in the text data. Furthermore, only the n-gram is returned, not the text matched by the Prefix and Suffix Patterns.

The Join Pattern property is unique to the Word Match extractor. This determines how terms of bigrams, trigrams, four-grams, and five-grams can be joined. Most often, terms (or grams) are simply joined by a single space, as in the bigram "first second". If you leave this property blank, Grooper will assume n-grams are always separated by a single space. However, you may want to include n-grams that are separated by other characters. For example hyphenated words, as in "first-second". The Join Pattern allows you to enter a regular expression for the allowable characters between two grams. For example, a Join Pattern of [ -] would allow for a single space or hyphen to be between each term, matching "first second" as well as "first-second".

The Output Format allows you to alter the output result for data cleansing or other purposes.

The "Properties" tab allows you to further configure the n-gram matching. Most importantly, the n-gram size is set here as well as any Lexicon used to lookup against the returned values. You can also enable Tab Marking, Fuzzy RegEx mode, filter results based on page location, determine case sensitivity, and more.


In this example, a Value Reader is configured to return bigram field labels, using the Word Match Extractor Type.

  1. On the "Value Reader" tab, Word Match is selected as the Extractor Type
  2. The Word Pattern is entered here.
    • The regex pattern entered here is used to match each single gram in the n-gram. The default pattern \p{L}+ matches any combination of letter characters in any language of any length. In most cases, this pattern will perfectly suit your n-gram extraction needs. However, you can alter this pattern if you need. For example, [a-zA-Z]+ is a very similar pattern that could be used to match English only words, as it does not include characters of foreign scripts. For example, it would not match Greek characters, such as Ω, where \p{L}+ would.
  3. The Prefix Pattern is entered here.
    • In this case, the pattern entered will only match n-grams if they are preceded by a \n \t or beginning of string ^ character.
  4. The Suffix Pattern is entered here.
    • In this case, the pattern entered will only match n-grams if they are followed by a \r \t or end of string $ character.
  5. The Join Pattern is entered here.
    • The pattern here, [ \-] will return n-grams whos grams are separated by a single space character, a backspace, or a hyphen. If left blank, only n-grams whose grams are separated by a single space character are returned.
  6. The Output Format is formatted here.
    • Unused in this example.


In this case, we also used the "Properties" tab to set the n-gram size to collect bigrams, and only return grams in a English language dictionary.

  1. Navigate to the "Properties" tab.
  2. The Word Lookup property can be used to reference a Lexicon of allowable terms for each gram in the n-gram.
    • Here, we reference the "English Words" Lexicon that ships with every Grooper install in the "Essentials" folder of the Global Resources folder.
  3. The Phrase Size property allows you to specify the size of the n-gram.
    • Here, it is set to 2 to capture bigrams.

Labeled Value

As the name implies, Labeled Value extractor is designed to return labeled values. A common feature of structured forms is to divide information across a series of fields. But it's not as if you just have a bunch of data randomly strewn throughout the document. Typically, the field's value will be identified by some kind of label. These labels provide the critical context to what the data refers to.

Labeled Value relies on the spatial relationship between the label and the value. Most often labels and their corresponding values are aligned in one of two ways.

1. The value will be to the right of the label.

2. The value will be below the label.

Labeled Value uses two extractors itself, one to find the label and another for the value. If the two extractors results are aligned horizontally or vertically within a certain amount of space (according to how the Labeled Value extractor is configured), the value's result is returned.


In this example, a Value Reader is configured to return the "Cash to Close" amount as described on the first page of a Closing Disclosure form, using the Labeled Value Extractor Type.

  1. On the "Value Reader" tab, Labeled Value is selected as the Extractor Type.
  2. The label is returned by the Label Extractor.
    • As an extractor, this could be any of the 14 Extractor Type options (Pattern Match, List Match, Reference, etc). Most commonly it will be either Pattern Match or a Reference to another Value Reader or Data Type.
    • In this case, a Pattern Match extractor is configured to locate the phrase "Cash to Close".
  3. The value is returned by the Value Extractor.
    • Again, this could be any of the 14 Extractor Type options (Pattern Match, List Match, Reference, etc). And commonly as well, it will be either Pattern Match or a Reference to another Value Reader or Data Type.
    • In this case, a Reference is made to the currency extractor Value Reader example in the Pattern Match tab of this article.
  4. The Layout properties in general determine the spatial relationship between the label and the value. The Maximum Distance property is used to determine the distance the value is from the label.
      • Note, there are all kinds of currency values on the page here, but only the value aligned with the label is returned.
    • The Bottom maximum distance is also set to 2 inches in this case. This is actually unnecessary in this example. This would also capture values that are within 2 inches below a label.
  5. In the "Document Viewer" the label extracted by the Label Extractor is outlined in blue.
  6. The corresponding value extracted by the Value Extractor is highlighted in green.


The Labeled Value extractor has some increased functionality when used in combination with a Labeling Behavior. For more information on the Labeling Behavior and how Labeled Value benefits from it, visit the Labeling Behavior article.

Field Match

About

The Field Match extractor allows you to match a value stored in a previously extracted Data Field. Instead of matching a regex pattern and returning a value, this extractor will match the Data Field's text and return a value. Or, you might think of it as the "pattern" is the Data Field's value. You can even utilize capabilities of a Pattern Match extractor. For example, you can use Prefix and Suffix Patterns to anchor the Data Field's value to a specific text based location, just like you can anchor a regex pattern with Prefix and Suffix Patterns with a Pattern Match extractor. You can also parse the Data Field's result with a Parse Pattern.

This could be used for data validation purposes. We've been using Closing Disclosure forms for most of these examples. Let's say the original imported file we imported for processing was named after the loan's borrower. So if John Doe was applying for the loan, the Closing Disclosure would be named "John Doe.pdf". You might want to know if the borrower's name according to the filename actually lines up with the borrower listed on the document. You could do just that with a Field Match extractor.


In this case, we could create a Data Field using a Default Value expression to return the name in the file name.

  1. Here, we have a Data Field named "Borrower From File Name"
  2. The Default Value expression is configured to return everything in the file name except the file path.
    • Match(Folder.AttachedFileName, ".*(?=\.pdf)")


  1. The original PDF file imported into Grooper is named "Eddie Kusick and Ody Boeck.pdf"
  2. Everything in that file name but ".pdf" populates the Data Field.


Using a second Data Field, we're going to use a Field Match extractor to make sure the information returned by the "Borrower From File Name".

  1. Here, we have a Data Field named "File Name Borrower Validation".
  2. Its Value Extractor property is set to reference a Value Reader using the Field Match extractor.
    • We will go over how this is configured later.
  3. We've also set this Data Field's Required property to True.
    • This will flag the field if no value is populated. If the "Borrower From File Name" Data Field's value is not found on the document, the Field Match extractor will fail to produce a result. So, the field will be flagged, indicating there's some kind of mismatch with the borrower's name according to the file name and the borrower's name according to the document.

There's a specific order of operations that needs to happen before the Field Match extractor can return a result.

First, the Data Field supplying the result to be matched needs to execute. The Field Match extractor can't match the Data Field's value if it hasn't been found yet.

If we attempted to test extraction with this Data Field selected, we would not return a result. We need to go up a level in the data hierarchy to the Data Model. That way the "Borrower From File Name" Data Field will execute, returning a result. Then, the "File Name Borrower Validation" Data Field, using a Field Match extractor will be able to match the result.

Furthermore, Data Fields in a Data Model execute sequentially. You want to make sure the Data Field referenced by the Field Match extractor is listed before the Data Field executing the Field Match extractor.


  1. We can now test this result on the Data Model
  2. Navigate to the "Tester" tab and press the Play button.
  3. The "Borrower From File Name" Data Field returns the borrower's name from the file name.
  4. The "File Name Borrower Validation" Data Field uses a Field Match extractor, matching the "Borrower From File Name" Data Field's" result with text found on the document.


  1. In the case of this document, the original file's name is "Bad Name.pdf"
  2. Upon testing extraction...
  3. The Field Match extractor fails to produce a result. The result "Bad Name" is not returned by the extractor.
  4. The borrower's name on the document is "Cindi Truwert and Audrey Feak"

Configuration

But how do you build a Field Match extractor? In this example, a Value Reader is configured to return borrower's name on a Closing Disclosure form if it matches a Data Field returning the borrower's name from the native file's name, using the Field Match Extractor Type. This is the Value Reader using the Field Match extractor described in the section above.

  1. Select the "Value Reader" tab
  2. Field Match is selected as the Extractor.
    • The very first thing you want to do is reference the Data Field whose value you're matching on the document. To do this, expand the Extractor property and select the Field property.
  3. Using the dropdown menu, choose the Data Field you want to match.


As far as the example in the About section above goes, that's it. That's all we did to get the results seen in this example. If the value returned by the "Borrower From File Name" Data Field is found on the document, it will be returned. If not, not result will be returned.


However, let's look at a couple other things you may want to consider when using the Field Match extractor.

  1. Select the "Expressions" tab.
  2. The Value Pattern in this case is optional. This can act as a "fall back" pattern if the Data Field's result is not matched. Grooper will prioritize returning the Data Field's value if it matches it on the document. However, if it does not, and the regex entered in the Value Pattern does match something, the Value Pattern's result will be returned.
  3. The Parse Pattern will parse the Data Field's result using a regular expression pattern.
    • For example, the regex pattern here [A-Z][a-z]+ [A-Z][a-z]+ would cause the extractor to return "Eddie Kusick" instead of "Eddie Kusick and Ody Boeck".
    • CAUTION!!! Grooper's regex is case insensitive by default in most cases. Not so with the Parse Pattern. The Parse Pattern's regex is always case sensitive.
  4. The Prefix and Suffix Pattern can be used to anchor the result to another regex pattern (just like a Pattern Match extractor)
    • Here, we really want to make sure the name in the file name matches the borrower's name. The pattern Borrower\s will cause the Field Match extractor to only return the Data Field's value if it is preceded by this Prefix Pattern.

Remember, there's a specific order of operations that needs to happen before the Field Match extractor can return a result.

First, the Data Field supplying the result to be matched needs to execute. The Field Match extractor can't match the Data Field's value if it hasn't been found yet.

If we attempted to test extraction using the "Test Single" button, we would not return a result. We'd have to go to the Data Model to test this extractor. That way the "Borrower From File Name" Data Field will execute, returning a result. Then, the "File Name Borrower Validation" Data Field, using this Field Match extractor will be able to match the result.

Furthermore, Data Fields in a Data Model execute sequentially. You want to make sure the Data Field referenced by the Field Match extractor is listed before the Data Field executing the Field Match extractor.

LLM Extractors

Ask AI

Ask AI is an Extractor Type that executes a chat completion using a large language model (LLM), such as OpenAI's GPT models. It uses a document's text content and user-defined instructions (a question about the document) in the chat prompt. Ask AI then returns the response as the extractor's result. Ask AI is a powerful, LLM based extraction method, that can be used anywhere in Grooper an Extractor Type is referenced. It can complete a wide array of tasks in Grooper with simple text prompts.

The general idea behind Ask AI is simple: prompt an LLM chatbot with a question about a document to return a result. With appropriate instructions, Ask AI can even parse JSON responses into a Data Model's instance hierarchy. For example, you can use Ask AI as a Row Match Table Extract Method.

Ask AI Pros and Cons

Pros

  • Returns data using a natural language prompt.
  • Less knowledge of Grooper extractors required to return data
  • Quicker time to value.
  • Easier to maintain over time.

Cons

  • LLM responses can be unpredictable.
  • LLM responses can be inaccurate.
  • LLMs answer by predicting the next best word one after another. They do not “know” anything.
  • As an extractor, “Ask AI” must be configured “per field”.
  • An API call is made every time the extractor executes.
    • This is very important to consider. Limiting the text sent to the LLM whenever possible should be a consideration. The Context Extractor property can play a big role in helping with this.
  • No result highlighting
Properties

Model
The API Key you use will determine which GPT models are available to you. The different GPT models can affect the text generated based on their size, training data, capabilities, prompt engineering, and fine-tuning potential. GPT-3's larger size and training data, in particular, can potentially result in more sophisticated, diverse, and contextually appropriate text compared to GPT-2. However, the actual performance and quality of the generated text also depend on various other factors, such as prompt engineering, input provided, and specific use case requirements. GPT-4o is the latest version, as of this writing, and takes the GPT model even further.

Parameters
Please see the Parameters article for more information.

Instructions
The instructions or question to include in the prompt. The prompt sent to OpenAI consists of text content from the document, which provides context, plus the text entered here. This property should ask a question about the content or provide instructions for generating output. For example, "what is the effective date?", "summarize this document", or "Your task is to generate a comma-separated list of assignors".

Preprocessing
Please visit the Preprocessing article for more information.

Context Extractor
An optional extractor which filters the document content included in the prompt. All Value Extractor types are available.

Max Response Length
The maximum length of the output, in tokens. 1 token is equivalent to approximately 4 characters for English text. Increasing this value decreases the maximum size of the context.

Parse JSON Response
If this property is enabled, JSON returned in the response will be parsed into a Data Instance hierarchy.

Use this mechanism to capture complex data and generate output instances with named children, producing output similar to using named regex groups with Pattern Match, or using Ordered Array collation. This type of Data Instance hierarchy can be consumed by the Row Match Table Extract Method or the Simple Section Extract Method.

When this property is enabled, this instructions should ask the AI to respond with JSON, and provide instructions and examples as need to ensure the AI understands the desired JSON format. The JSON may contain a single JSON object or an array of JSON objects.

If a single JSON object is returned, a single output instance will be generated, containing one named child for each property of the JSON object. This type of output would be appropriate for capturing a single-instance Data Section.

If a JSON array is returned, one output instance will be generated for each object in the array. Each output instance will have named children reflecting the properties of the JSON object. This type of output is appropriate for capturing a Data Table or multi-instance Data Section.

The AI must respond with JSON only, or with the JSON delimited using the prefix and suffix shown below. OpenAI models are typically trained for this out of the box, but models trained by other organizations may require special instructions.

OMR Extractors

OMR stands for "optical mark recognition". Many structured forms utilize checkboxes in order to detail information. Filling out a form is much quicker if you can just check a box to indicate a choice from a list of options rather than printing a response. It also makes it easy on you if you're presented the list of possible options, even if that is as simple as checking a box next to "Yes" or "No".

However, checkboxes don't translate to a text character when recognizing a document's text data through OCR or native next extraction. In order for Grooper to understand if a checkbox is checked, it must digitally recognize the box and its "check state", checked or unchecked. OMR is this digital process of recognizing checkboxes and their check states.

In previous versions of Grooper (pre 2021), checkboxes and their check states were only determined from layout data detected by a Box Detection or Box Removal IP Command and saved to the Batch Page (or in some cases Batch Folder) of a document. Layout data collection is still important for the three OMR extractors.

Improvements have been made to the Labeled OMR extractor allowing it to return labels next to checkboxes without first obtaining the document's layout data. However, it will still use detected checkbox information in the layout data, if present. In most cases, this results in Labeled OMR being the simplest and most effective OMR extractor. Furthermore, it is the only OMR extractor that is capable of extracting radio buttons and their "press state" as checkboxes (Radio buttons are not detectable by Box Detection or Box Removal as they are not boxes).

Ordered OMR and Zonal OMR exist as options where Labeled OMR fails to produce the desired result. However, they typically require more configuration and checkbox information from the layout data is required in order to return a result.

Obtaining Layout Data

Layout data is visual information on a document obtained during an image processing operation. This includes checkboxes and their check states, line locations, and barcodes. Image processing in Grooper is primarily performed to clean up a document in order to obtain better OCR results from printed text, but it's also used to obtain this layout data. This is controlled by creating an IP Profile, which is a step by step list of IP Commands, each one performing a different image processing operation. This includes layout data collecting IP Commands, such as Box Detection and Box Removal.

The IP Profile can be executed permanently, affecting the archival export of the document, or temporarily, reverting back to the original image after OCR is performed.

  • For permanent image processing, the IP Profile is executed by the Image Processing activity.
  • For temporary image processing, the IP Profile is executed by the Recognize activity.
    • The Recognize activity obtains a document's text data via OCR for scanned or image-based documents or native text extraction for digital documents with native machine readable text already present.
      • For scanned or image-based documents, the IP Profile is referenced in the OCR Profile used.
      • For digital documents, you aren't performing OCR. Machine readable text is already present as part of the document's content and Recognize extracts that native digital text. However, the Alternate IP property can be used to reference an IP Profile containing layout data collecting IP Commands.

In either case, if the IP Profile contains layout data collecting IP Commands, layout data will be stored on the processed Batch Page object's "Grooper.Layout.json" file. For example, Box Detection and Box Removal will store checkbox locations and their check states (either checked or unchecked). The OMR extractors will then use this layout data to return labels next to checked boxes.


  1. Here, we have created an IP Profile named "Layout Data".
  2. This IP Profile's second step uses the Box Removal IP Command
    • Note: Whether using Box Detection or Box Removal Grooper will detect checkboxes and store their locations and check states (either checked or unchecked). Box Removal will digitally remove the checkbox from the page's image whereas Box Detection will not.
  3. The box detection settings are determined by the General Settings
  4. In the "Image Diagnostics" panel, the Execution Log of the Box Removal folder will show you the results of the Box Removal command.
  5. Here, you can see all the detected checkboxes, their locations, and their check states.
  6. The Boxes diagnostic image will show you visually where checkboxes are on a page. Detected unchecked boxes will be highlighted in red. Checked boxes are highlighted in green.


FYI: Layout Data Verification

Most often, a Box Removal command is executed by a temporary IP Profile during the Recognize activity. However whether executed during Image Processing or Recognize, either way will save the checkbox information to a Batch Page object's "Grooper.Layout.json" file.

You can verify this with the "Files" tab of the "Advanced" tab when selecting a Batch Page or Batch Folder processed with a Box Detection command.

  1. Select the processed Batch Page in the node tree.
  2. Navigate to the "Advanced" tab.
  3. Navigate to the "Files" tab.
  4. Select the "Grooper.Layout.json" file.
    • If this file is not present, either no layout data was detected by the steps in an IP Profile or the IP Profile was not executed yet.


  1. The detected checkbox information is stored in the JSON file in a way Grooper can quickly access.

Labeled OMR

The Labeled OMR extractor is designed to be the easiest OMR extractor to set up. In most cases, it's as simple as define the list of labels next to the checkboxes, determine if multiple boxes can be checked, just one, or if a checked box evaluates to a binary true/false value, and the label (or labels) next to the checked boxes are returned.

The Labeled OMR has just two properties necessary for its configuration: Label Extractor and Mode

The Label Extractor serves to return any labels next to a checkbox. In many cases, this is a simple as using a List Match extractor, entering a list of the text labels on the document.

The Mode property corresponds with how the checkboxes behave on the document. Can just one checkbox out of many be checked? Can multiple? Is there just one checkbox where it being checked means one thing and unchecked another? This can be one of three options: CheckOne, CheckMulti or Boolean

  • CheckOne will target multiple checkboxes but presumes only one may be checked. This is for when documents present a list of options, only one of which may be chosen.
  • CheckMulti will target multiple checkboxes and presumes any number of them may be checked. This is for when documents present a list of options, but any one of them can be chosen.
    • All labels are returned as a single concatenated result. This results may be separated by a Separator String. For example a , could be used to create a comma separated list of checked values.
  • Boolean targets only a single checkbox, presuming the checkbox represents a Boolean "true or false" answer. This is for when documents present a single checkbox where checking the box indicates one thing and leaving it unchecked means another.
    • By default the result will evaluate to either "True" or "False" but this can be altered using the Value If Checked and Value If Unchecked properties.


In this example, a Value Reader is configured to return the type of loan applied for on a "Closing Disclosure" form, using the Labeled OMR Extractor Type.

  1. Labeled OMR is selected as the Extractor Type.
  2. The Label Extractor is configured to return the checkbox labels.
    • Here, we used a simple List Match extractor with our three loan types in its Local Entries list (We're ignoring the fill-in "Other" option for the sake of simplifying this example).
      Conventional
      FHA
      VA
    • List Match will be the most common extractor configuration. However, you have full access to every different extractor type. You could use Pattern Match or reference a Data Type, whatever works for your document set. The important thing to keep in mind is you need to return a single result for each individual label. Here, we have three checkbox labels. The Label Extractor thus returns three results, one for each label.
  3. Configure the Mode property according to how the checkboxes behave on the document.
    • Here, there can only be one type of loan for each individual Closing Disclosure. You'd never have a home loan that is both a conventional loan and an FHA loan. So, it is set to CheckOne.
  4. In the "Document Viewer", labels are outlined in blue and detected checkboxes are highlighted in green.
  5. The label next to the checked box is returned.
    • Technically, for CheckOne mode, all these labels are returned, but the checked label is returned at 100% confidence where the unchecked ones are returned at 0%. This extractor will always return the most confident value first which is ultimately what we want.
    • Note: If multiple boxes are checked with CheckOne selected, all labels will return with 0% confidence.


n Layout Data and Labeled OMR

Labeled OMR is unique among the OMR extractors in that Box Detection layout data is not necessary in order for it to function. It will use that data if present, but will also attempt to determine if a checkbox is next to a label if it is not.

This is exceptionally useful for non-standard checkboxes difficult or impossible for Box Detection to detect. For example, radio buttons. Radio buttons are just circles. If the button is checked, it has a dot inside it, but otherwise just a circle. Box Detection only detects boxes, squares and rectangles. There's no way a radio button's location and check state will be stored in a documents layout data file.

Not to fret! New improvements to the Labeled OMR extractor in version 2021 allow it to detect radio buttons and other non-standard checkboxes.


Here, the Labeled OMR extractor is unchanged from the example described above.

  1. However, this document uses radio buttons instead of checkboxes to detail the Closing Disclosure's loan type.
  2. In both cases, the correctly checked label is returned.

Ordered OMR

The Ordered OMR extractor is a little more complicated to set up than Labeled OMR but can be used in cases where Labeled OMR is not producing the desired results. Furthermore, labels are not even required to be present at all (but can be optionally helpful as "anchors", positioning where the checkboxes are on the page). This extractor can prove very useful when you have structured forms with OMR checkboxes whose labels are not easily matched due to poor OCR.

However, checkbox data must be present in the document's layout data. Ordered OMR will not function without this data present before executing. Furthermore, Ordered OMR assumes the checkboxes will be ordered one after the other either vertically or horizontally along a single line. For documents with sections of checkboxes broken up into multiple columns (for vertically ordered checkboxes) or multiple lines (for horizontally ordered checkboxes), multiple Ordered OMR extractors may be necessary (one for each column or line of checkboxes).

For Ordered OMR you will indicate where boxes are on a document by drawing a rectangular zone around the checkboxes. All checkboxes must fall within the drawn zone (This distinguishes Ordered OMR from Zonal OMR. For Zonal OMR, a single zone is drawn for each individual checkbox). This zone is configured using the Location property. This can be one of four options:

  1. Fixed Region - This option is the simplest to set up. As the name implies, the rectangular zone will be fixed on the page. It will stay in the same coordinates for every document. All you need to do is draw the box around the checkboxes.
  2. Relative Region - Instead of setting the zone in a fixed location for every document, the Relative Region mode will anchor the zone to a text label on the document. The zone's position will change relative to the label's position on the document, but will still have the same drawn dimensions.
    • This option is useful to overcome issues arising during scanning printed documents. Slight variations can occur as to where a value is when printing or scanning a document, even for very structured documents. This can cause problems when drawing a single fixed region for the zone. However, if you can anchor the zone off an extractable text value, the zone's position will shift according to that anchor's position.
  3. Text Region - The Text Region option creates a rectangular zone using the logical boundaries of an extraction result. This can create the zone within the boundaries of the extractor's result.
    • This can also be configured to provide results in a similar way the Relative Region option does, using text anchors located by an extractor to position the extraction zone's location. This means both methods can be used to position the zone relative to a point from document to document. The main difference is in how the zone is drawn.
  4. Shape Region - The Shape Region option is extremely similar to the Text Region option. However, instead of using text to anchor the extraction zone, it uses a shape detected from a Shape Detection or Shape Removal IP Command.
    • This is the least common method used.

Ordered OMR must also have its Mode property configured. This property behaves the same as Labeled OMR. It determines how many checkboxes should be checked for the checkboxes falling within the rectangular zone. This can be one of three options:

  • CheckOne will target multiple checkboxes but presumes only one may be checked. This is for when documents present a list of options, only one of which may be chosen.
  • CheckMulti will target multiple checkboxes and presumes any number of them may be checked. This is for when documents present a list of options, but any one of them can be chosen.
    • All labels are returned as a single comma-separated result.
  • Boolean targets only a single checkbox, presuming the checkbox represents a Boolean "true or false" answer. This is for when documents present a single checkbox where checking the box indicates one thing and leaving it unchecked means another.

Ordered OMR is different from Labeled OMR in that the output values are not located by a Label Extractor. Instead, you will use the Output Values property to list each checkbox's corresponding value. Ordered OMR presumes the checkboxes will be stacked on top of each other or next to each other. They will either be stacked on top of each other in a single column, or they will be ordered next to each other across a single horizontal line. Using the Output Values property, you will enter a comma-separated list of the checkboxes' labels from top to bottom or left to right.

  • When selecting the Boolean Mode, only two values may be entered.

The Flow Direction property determines how they are ordered, either Vertical or Horizontal. Vertical is appropriate for boxes stacked on top of each other. Horizontal is appropriate for boxes next to each other along a horizontal line.


In this example, a Value Reader is configured to return the "This estimate includes" options for a "Closing Disclosure" form, using the Ordered OMR Extractor Type.

  1. Ordered OMR is selected as the Extractor Type.
  2. For any Ordered OMR configuration you must configure the Location property. This determines where the zone is placed on each document. All checkboxes should fall in this zone.
    • In this case, we've selected Fixed Region and drawn a rectangle on the page. These forms are highly structured. We can assume the checkboxes will always fall within the same rectangular coordinates.


  1. You can see the drawn zone in green in the "Document Viewer" pane.
  2. Configure the Mode property according to how the checkboxes behave on the document.
    • Here, multiple boxes can be checked. The estimate could include property taxes, homeowner's insurance, other costs, or any combination of the three. So, it is set to CheckMulti
  3. Each label is entered as a comma-separated list in the Output Values property.
  4. Whether the checkboxes are ordered vertically or horizontally is set by Flow Direction property.
    • Here, the checkboxes are stacked on top of each other. So, it is set to Vertical.

Zonal OMR

For Zonal OMR a rectangular zone must be drawn around the location of each individual checkbox (rather than a single zone for all checkboxes as is the case for the Ordered OMR extractor). This typically makes Zonal OMR the most time consuming of the OMR extractors as far as set up goes, but may be necessary to target checkboxes on forms whose order is not targetable by Ordered OMR (for example, checkboxes in non-standard orientations next to their labels) and/or whose labels cannot be extracted by Labeled OMR (for example, due to poor OCR).

Furthermore, (just like the Ordered OMR extractor) checkbox data must be present in the document's layout data. Zonal OMR will not function without this data present before executing.

Just like Labeled OMR and Ordered OMR, Zonal OMR must also have its Mode property configured. It determines how many checkboxes should be checked for the checkboxes falling within the rectangular zones. This can be one of three options:

  • CheckOne will target multiple checkboxes but presumes only one may be checked. This is for when documents present a list of options, only one of which may be chosen.
  • CheckMulti will target multiple checkboxes and presumes any number of them may be checked. This is for when documents present a list of options, but any one of them can be chosen.
    • All labels are returned as a single comma-separated result.
  • Boolean targets only a single checkbox, presuming the checkbox represents a Boolean "true or false" answer. This is for when documents present a single checkbox where checking the box indicates one thing and leaving it unchecked means another.
    • By default the result will evaluate to either "True" or "False" but this can be altered using the Value If Checked and Value If Unchecked properties.

The Anchor property allows all drawn zones for each checkbox to be anchored to an extractible text result. This is similar to how the Relative Region Location option of Ordered OMR can anchor a zone to a relative location on the page rather than a fixed position that remains the same for each and every document.

  • This option is useful to overcome issues arising during scanning printed documents. Slight variations can occur as to where a value is when printing or scanning a document, even for very structured documents. This can cause problems when drawing a single fixed region for the zone. However, if you can anchor the zone off an extractable text value, the zone's position will shift according to that anchor's position.

The rectangular zones are drawn using the OMR Boxes property. This will bring up a collection editor to draw registration zones for each checkbox. Here, you will also enter the output result for each OMR zone. This editor also allows you to anchor zones to a text anchor for each individual zone, meaning you can anchor a single zone here (as well as anchoring the collection of OMR zones using the Anchor property described above).


In this example, a Value Reader is configured to return the type of loan applied for on a "Closing Disclosure" form, using the Zonal OMR Extractor Type.

  1. Zonal OMR is selected as the Extractor Type.
  2. Configure the Mode property according to how the checkboxes behave on the document.
    • Here, there can only be one type of loan for each individual Closing Disclosure. So, it is set to CheckOne.
  3. The Anchor property is used to anchor the collection of OMR zones to a text result on the document.
    • Note: Using an Anchor is optional. You can choose to configure an anchor extractor or not. In many cases, using one will increase the accuracy of your results, but it is not required. Furthermore, using the Anchor sub properties you can choose to make the anchor required to perform extraction or not if it fails to produce a result.
  4. We chose to anchor the collection of OMR zones to the text "Loan Type" in this case.
  5. One rectangular zone is drawn around each checkbox using the OMR Zones collection editor.


  1. Add one zone for each checkbox using the "Add" button.
  2. Use the Value property to enter the output result.
  3. Use the Bounds property to enter the zone's coordinates.
    • Select this property and press the ellipsis button at the end to lasso the zone with your cursor.

In this case, we have four checkboxes. So, we have added four items to our collection list here, one for each type of application.


Remember how we ignored the "Other" application type for our Labeled OMR example? The text entered for the "Other" application type could be anything, which makes it more difficult to extract a label. The text could be anything under the sun.

  1. With Zonal OMR, we didn't have to enter the label. We just drew a zone around the corresponding checkbox.
  2. Its output result is what we entered in this OMR box's Value property, "Other".

Note: This doesn't mean you couldn't use Labeled OMR and still successfully extract the text label here (even though it could be anything!). It would just require a more complex extractor than we used for our earlier example.

Barcode Extractors

Both the Find Barcode and Read Barcode extractors will return a barcode's encoded value as its result. Barcodes encode information according to the "symbology" used. Grooper has the ability to detect and read 29 different symbologies, including Code128, Code39, US postal barcodes, QR codes, and UPC barcodes.

For both the Find Barcode and Read Barcode extractors, you must specify which barcode symbology you're looking for. The difference between the two Extractor Types is when Grooper's barcode detection runs before the value is extracted.

  • Find Barcode - The barcode's value must be obtained by a Barcode Detection or Barcode Removal IP Command in an IP Profile before the extractor executes.
    • When the Read Barcode extractor executes on a document folder, it looks for the barcode value in the Batch Folder's layout data file or it's child Batch Pages' layout data files. If present, the barcode's value is returned.
  • Read Barcode - This extractor executes barcode detection every time the extractor executes. This means a layout data file is not necessary for the extractor to return a barcode value.
    • No layout data required. No previous Barcode Detection or Barcode Removal IP Command is necessary.
    • The extractor is configured to locate a barcode in nearly the same way you configure the Barcode Detection IP Command to detect the barcode.

Your decision as to which one you choose to use will largely be based on when you want to expend the processing time required to detect the barcode. Read Barcode performs barcode detection when the extractor executes. For Find Barcode, the extractor presumes the barcode was already detected and present in the document folder's layout data. Layout data is stored in a file named "Grooper.Layout.json" when an IP Command such as Barcode Detection detects layout data on a Batch Page (or in certain cases a Batch Folder).

Find Barcode simply uses the "Grooper.Layout.json" file to return the barcode value. Barcode detection takes time. The processing time to read a document's "Grooper.Layout.json" file might take 3 milliseconds. The processing time to detect a barcode on a document takes significantly more processing power. Let's say 300 milliseconds. Find Barcode just finds the barcode value in the "Grooper.Layout.json" file, taking 3 ms to return a value. On the other hand, Read Barcode must first read the barcode before returning the value, taking 300 ms to return a value. Therefore, Find Barcode has a faster runtime execution than Read Barcode.

However, this presumes the barcodes value is already present in the "Grooper.Layout.json" file. So, you're still going to have to find the barcode at some point in the Batch Process. You just have to ask yourself if you want to spend that processing time during extraction or ahead of time in some kind of image processing operation (either the Image Processing activity for permanent image processing or the Recognize activity for temporary image processing).

Read Barcode

In this example, a Value Reader is configured to return the date encoded in a barcode, using the Read Barcode Extractor Type.

  1. Read Barcode is selected as the Extractor Type.
  2. For any configuration, you must define which barcode symbologies are used in your document. This is configured using the Detection Settings properties.
  3. You must choose which barcode reader is used to detect the barcode symbology. There are four Reader properties available: Standard Reader, 1D Reader, 2D Reader, and Postal Reader
    • The 1D Reader , 2D Reader, and Postal Readers work faster and generally provide better results than the Standard Reader but are limited in the barcode symbologies they support.
    • In this case, the barcode uses the "Code 39" symbology, which is supported by the 1D Reader. Thus, the 1D Reader is Enabled.
  4. Use the Barcode Symbologies property to select which barcode symbology you wish to detect.
    • In this case, we have selected Code39.
    • Note: You may select multiple barcode symbologies. However, be careful as this can provide false positive results. Certain barcodes use only slightly different symbologies than others.
  5. Optionally, you may configure the Value Pattern property to write a regular expression pattern to validate the barcode's value.
    • In this case a simple regex matching a date format \d{1,2}/\d{1,2}/(\d{4}|\d{2}) is used to validate the returned value is a date.
    • Note: The regex pattern written here will not parse the value at all, just validate it. If the regex matches any portion of the barcode's value, the whole value is returned.
  6. When the Value Reader executes, the barcode is detected and its encoded value is returned.

Find Barcode

Configuration Prereqs - Obtain Layout Data

In the case of Find Barcode, the barcode must previously be detected by a Barcode Detection or Barcode Removal IP Command with its value stored in the document's layout data.

  1. Here, we have created an IP Profile named "Layout Data".
  2. This IP Profile's first step uses the Barcode Removal IP Command
    • Note: Whether using Barcode Detection or Barcode Removal Grooper will detect the barcode and store its value. Barcode Removal will digitally remove the barcode from the page's image whereas Barcode Detection will not.

From this point, barcode detection is configured exactly the same way as seen in the Read Barcode example.

  1. For any configuration, you must define which barcode symbologies are used in your document. This is configured using the Detection Settings properties.
  2. You must choose which barcode reader is used to detect the barcode symbology. There are four Reader properties available: Standard Reader, 1D Reader, 2D Reader, and Postal Reader
    • The 1D Reader , 2D Reader, and Postal Readers work faster and generally provide better results than the Standard Reader but are limited in the barcode symbologies they support.
    • In this case, the barcode uses the "Code 39" symbology, which is supported by the 1D Reader. Thus, the 1D Reader is Enabled.
  3. Use the Barcode Symbologies property to select which barcode symbology you wish to detect.
    • In this case, we have selected Code39.
    • Note: You may select multiple barcode symbologies. However, be careful as this can provide false positive results. Certain barcodes use only slightly different symbologies than others.
  4. In the "Image Diagnostics" panel, the Execution Log of the Barcode Detection folder will show you the results of the Barcode Removal command.
  5. Here, you can see a "Code30" barcode was found, its positional boundaries, and its encoded value ("01/12/2020").


Before executing the Find Barcode extractor, the IP Profile with the Barcode Removal or Barcode Detection IP Command must be executed to save the barcode value in the document's layout data.

An IP Profile can be executed in one of two ways:

  1. During the Image Processing activity for permanent image processing.
  2. During the Recognize activity for temporary image processing.
    • For OCR of image based documents, the 'IP Profile is assigned to the OCR Profile assigned to the Recognize activity.
    • For native text extraction of digital document, an IP Profile can be assigned to the Alternate IP property of the Recognize activity.

FYI: Layout Data Verification

Most often, a Barcode Removal command is executed by a temporary IP Profile during the Recognize activity. However whether executed during Image Processing or Recognize, either way will save the barcode value to a Batch Page object's "Grooper.Layout.json" file.

You can verify this with the "Files" tab of the "Advanced" tab when selecting a Batch Page or Batch Folder processed with a Barcode Detection command.

  1. Select the processed Batch Page in the node tree.
  2. Navigate to the "Advanced" tab.
  3. Navigate to the "Files" tab.
  4. Select the "Grooper.Layout.json" file.
    • If this file is not present, either no layout data was detected by the steps in an IP Profile or the IP Profile was not executed yet.


  1. The detected barcode information is stored in the JSON file in a way Grooper can quickly access.

Find Barcode

In this example, a Value Reader is configured to return the date encoded in a barcode, using the Find Barcode Extractor Type. This document's layout data was previously obtained with the IP Profile described above during the Recognize activity.

  1. Find Barcode is selected as the Extractor Type.
  2. Since the barcode's value is already stored in the document's layout data file, all you need to do is define what barcode symbology you're looking for.
    • In this case, we're looking for a Code39 barcode value.
    • Note: You may also choose All to return any and all barcode values in the "Grooper.Layout.json" file.


  1. Because the barcode was detected before the extractor executes, it runs very quickly.
  2. Rather than digitally scanning the page for a barcode, it simply returns the already obtained information stored in the layout data file.

Zonal Extractors

Read Zone

The Read Zone extractor allows you to extract text data in a rectangular region (called a "extraction zone" or just "zone") on a document. This can be a fixed zone, extracting text from the same location on a document, or a zone relative to an extracted text anchor or shape location on the document.

Read Zone is useful for extracting data from highly structured documents. If a document's structure is fixed, it's going to have the same fields in the same physical location from one document to the next. The Closing Disclosure forms we've been looking at in this article are themselves fairly fixed. For example, the "Loan Amount" listed on the first page is more or less in the same spot for every single Closing Disclosure. The dollar amount itself may change, but there's only so much room that amount can take up on the document.

If you can draw a rectangle around the value you want to extract, and the value falls within the boundaries of that rectangle for every single document, extraction may be as simple as just extracting the text in the rectangle's location. This is referred to as "zonal extraction". You draw a zone where the value exists on the page and return the text data falling in the zone.

Read Zone has a few different options for where the box is placed using the Location property. This can be one of four options:

  1. Fixed Region - This option is the simplest to set up. As the name implies, the extraction zone will be fixed on the page. It will stay in the same coordinates for every document. All you need to do is draw the box where you want to extract data.
  2. Relative Region - Instead of setting the extraction zone in a fixed location for every document, the Relative Region mode will anchor the zone to a text label on the document. The extraction zone's position will change relative to the label's position on the document, but will still have the same drawn dimensions.
    • This option is useful to overcome issues arising during scanning printed documents. Slight variations can occur as to where a value is when printing or scanning a document, even for very structured documents. This can cause problems when drawing a single fixed region for the extraction zone. However, if you can anchor the zone off an extractable text value, the zone's position will shift according to that anchor's position.
  3. Text Region - The Text Region option creates an extraction zone using the logical boundaries of an extraction result. This can return all the text falling within the boundaries of the rectangle around the extractor's result.
    • This can also be configured to provide results in a similar way the Relative Region option does, using text anchors located by an extractor to position the extraction zone's location. This means both methods can be used to position the zone relative to a point from document to document. The main difference is in how the zone is drawn.
  4. Shape Region - The Shape Region option is extremely similar to the Text Region option. However, instead of using text to anchor the extraction zone, it uses a shape detected from a Shape Detection or Shape Removal IP Command.
    • This is the least common method used.

The Read Zone extractor can optionally re-process the text data with an OCR Profile. This can be used to perform custom OCR on the extracted text.

The text in the zone can also be itself extracted by a Value Extractor. This allows you to break up the document into a smaller portion and run an extractor on just the zone instead of the full document. Essentially, you use the Read Zone extractor to create a smaller data instance (from the larger document data instance) and use its Value Extractor property to return data from the smaller data instance.


In this example, a Value Reader is configured to return the "Loan Amount" value as described on the first page of a Closing Disclosure form, using the Read Zone Extractor Type.

  1. Read Zone is selected as the Extractor Type
  2. For any Read Zone configuration you must configure the Location property. This determines where the extraction zone is placed on each document.
    • In this case, we configured the Relative Region option. We're using the text label "Loan Amount" as the anchor for the drawn extraction zone. You fully configure whichever Location mode you choose by expanding and configuring its sub-properties.
      • The extracted anchor is seen in the "Document Viewer" outlined in blue.
      • The extraction zone is seen highlighted in green. Any text falling within that green box is returned as the result.
  3. The Output Full Region property is very handy. It doesn't change the result at all, it just shows the full size of the drawn zone on the page. This is extremely useful when testing and configuring the Read Zone extractor.
  4. If Output Full Region were set to False, only the text ($ 159,432.62) would be highlighted, not the full drawn zone seen here.

Highlight Zone

The Highlight Zone extractor is unique in that it doesn't actually extract anything at all!

So, why use it? Highlight Zone can be useful for quickly calling attention to Data Fields requiring manual validation during a Data Review activity. For example, handwritten fields on a form are unlikely to be recognized by OCR. OCR is designed to read machine printed characters. While there are some recent advancements in handwriting detection (Microsoft Azure's OCR service is particularly adept at recognizing handwritten characters), OCR engines tend to fail at recognize handwriting. In the case of fields written in by hand, you will likely need a human being to enter that information during Data Review.

However, if those fields are in the same spots on the document, you can use Highlight Zone to draw the data entry clerk's attention to its location on the page, saving them time and you money.

The highlighted zones are drawn using the exact same Location methods and property configurations as Read Zone.


In this example, a Value Reader is configured to highlight a signature field, using the Highlight Zone Extractor Type.

  1. Highlight Zone is selected as the Extractor Type.
  2. Just like with Read Zone, you must configure the Location property. This determines where the highlighted zone is placed on each document.
  3. In this case, we used Fixed Region. The drawn rectangle has the same coordinates (drawn using the Bounds property) for every single document.
  4. We also put a Page Filter on the zone. Since the signature page is always on page 5 of the document, we set it to 5.
  5. This could be used to draw a data entry clerk's attention to the signature page to quickly verify if the document is signed.

Detect Signature

While we're on the topic of signatures, there is a type of zonal extractor specifically designed to detect if a signature is present or not. This is the Detect Signature Extractor Type option. It's very similar to the Read Zone extractor in that you use one of the four Location options (Fixed Region, Relative Region, Shape Region or Text Region) to draw an extraction zone on a geographic region of the page.

However, rather than returning the OCR or native text data within the zone, an OMR-style extraction is performed. Think about a signature line. If you drew a box around where you expect someone to sign, nothing would be in the box if it was not signed. But regardless of the signature, some of the box would be filled in if it were.

The same basic concept applies for the Detect Signature extractor. Detect Signature determines this by a simple pixel count of the percentage of black pixels in the zone. Essentially, the extractor counts the number of black pixels in the extraction zone. If the number of black pixels falls above a certain percentage threshold, the extractor returns a value of "Signed" and if below it returns a value of "Not Signed".


In this example, a Value Reader is configured to return whether or not the "Applicant Signature" is present on the Closing Disclosure form, using the Detect Signature Extractor Type.

  1. Detect Signature is selected as the Extractor Type
  2. For any Detect Signature configuration you must configure the Location property. This determines where the extraction zone is placed on each document.
    • In this case, we configured the Fixed Region option, just like we did in the previous tab for the Highlight Zone extractor.
  3. Optionally, you may pre-process the image with an IP Profile using the IP Profile property.
    • This can be useful to aid the signature detection process.
    • Many signatures may be small or faint. Commonly, the IP Profile referenced here will use the Dilate Erode command to dilate (or "bloat") the signature's pixels to more easily detect the signature.
  4. The Fill Percentage property determines how many black pixels must be present in order for the extractor to consider the zone signed.
    • In this case, if at least 25% of the pixels in the extraction zone are black, the extractor will consider the zone as "Signed".
  5. Optionally, you may change the returned value using the Value If Filled and Value If Not Filled properties.
    • By default, the result if the fill percentage threshold is met will be Signed and Not Signed if the fill percentage is not met.
  6. Here, more than 25% of the pixels in the extraction zone are black (or "filled"). Therefore, the Detect Signature extractor returns a value of "Signed".


FYI

Keep in mind the Detect Signature extractor will examine the pre-processed image (not the image seen in the Document Viewer) if an IP Profile has been referenced using the IP Profile property.

Furthermore, this type of operation requires a black and white image to work. Grooper knows a pixel is "filled" because it is black and not white. If you do not use an IP Profile to pre-process the document's image and the document is color or grayscale, Grooper will pre-process the image on its own.

If you want control over how Grooper turns the image black and white, this is another reason you may want to us an IP Profile to customize how this is done, using a Threshold or Binarize command.

The Reference Extractor

The Reference extractor is just an extractor that's returning the results of another extractor object in the node tree. You can use the Reference Extractor Type to reference any of the three extractor objects: Data Types, Value Readers, and Field Classes. This can be useful to keep you from duplicating your efforts over and over again. For example, tf you have a variety of different extractors needing to return a currency value, don't create a new currency extractor every time you need to return a currency value. Just create a single currency value and use the Reference 'Extractor Type to use it and re-use it over and over.


In this example, a Value Reader is configured to return the results of the currency extractor we created for the Pattern Match example (named "Pattern Match - Currency), using the Reference Extractor Type.

  1. Reference is selected as the Extractor Type.
  2. Use the Extractor property to point to an extractor in the node tree.
    • You can reference a Data Type, Value Reader, or Field Class.
  3. When the Reference extractor executes, the results of the referenced extractor are returned.
  4. You can see the name of the extractor returning the value in the "Name" column.
  5. In this case, the referenced "Pattern Match - Currency" Value Reader created earlier.

Value Readers vs Data Types

Before version 2021, the Data Type extractor object was considered the bread and butter of data extraction. For many Grooper users a "data extractor" and a Data Type are synonymous. In some ways, this may remain to be the case. However, the introduction of the Value Reader object was (at least in part) designed to conceptually distinguish a Data Type from a "general purpose extractor". Instead, we should emphasize the Data Type's primary function in Grooper: Data Collation.

  • A Value Reader is a Grooper object designed for data extraction, returning the initial data set from the document.
  • A Data Type is Grooper object designed for data collation, processing and returning the final data set from the document.
    • Even choosing not to collate results (using the Individual Collation Provider) is still a method of collating data. It's just the simplest way of collating data.

Because data collation is so important for a variety of extraction techniques it's almost natural to equate collated data with extracted data. But, there's really two parts of what's going on. First, values are extracted from the document's text data then they are collated and finally returned by the Data Type.

While both Value Readers and Data Types are considered "extractors", they really have two different jobs as far as Grooper is concerned. One way to think of this is a Value Reader is a "data finder" while a Data Type is a "data manipulator". It is a Value Reader's job to locate and return data from a document. It's a Data Type's job to take that data and organize it, manipulate it or impose constraints on what counts as valid data.

For example, a Data Type can be configured to only return values that are stacked on top of one another in a vertical array (using the Array Collation Provider). Any values not aligned with each other vertically are tossed out of the final result. The results are collated into a vertical array. Before the results can be collated, they have to be found. Only then can it be determined if and how they are organized and manipulated. That's the true job of the Value Reader, finding and returning the initial data set. The Data Type then collates and returns the final data set.

To drive this point home, take a look at the Extractor property of the Data Type.

  1. Here, we have an unconfigured Data Type selected, returning no results.
  2. Select the Extractor property.
  3. Using the drop down menu to expand its property options, these should look very familiar.
    • They are the exact same options as the Extractor Type for a Value Reader. It's almost as if the Data Type's Extractor property is just a local Value Reader configured on the Data Type object itself.
    • The Data Type can't do anything without being supplied results by an extractor. This could be the results the Extractor property's configuration returns, or the results of a child or referenced Value Reader, or the results of a child or referenced Data Type with its own extraction configured.
    • If you boil it down, at some point, whenever you're doing the job of finding data you want the Data Type to collate and return, you're going to be making a Value Reader or configuring a property that behaves very much like a Value Reader.
      • The Value Reader is the data extractor or "data finder".
      • The Data Type is the data collator or "data manipulator".