Value Reader (Node Type)

From Grooper Wiki

This article was migrated from an older version and has not been updated for the current version of Grooper.

This tag will be removed upon article review and update.

This article is about the current version of Grooper.

Note that some content may still need to be updated.

20252024 20232021
Graphic depicting the Grooper Value Reader
Graphic depicting the Grooper Value Reader

quick_reference_all Value Reader nodes define a single data extraction operation. Each Value Reader executes a single Value Extractor configuration. The Value Extractor determines the logic for returning data from a text-based document or page. (Example: Pattern Match is a Value Extractor that returns data using regular expressions).

  • Value Readers are can be used on their own or in conjunction with pin Data Types for more complex data extraction and collation.

Value Readers are Grooper's "one stop shop" for data extraction. They return a single or a list of numerical or lexical results from a page or document folder's text data (obtained via OCR or native text extraction from the format_letter_spacing_wide Recognize activity).

You may download and import the file below into your own Grooper environment (version 2024). This contains a Batch with the example document(s) discussed in this article and a Project containing a Content Model configured according to its instructions.

About

Value Reader nodes form the foundation for extracting information from a document, using a variety of different methods.

  • Do you need to extract a date? A Value Reade can do that!
  • Do you need to extract anything matching a list of values? A Value Reader can do that!
  • Do you need to extract English language unigrams (or bigrams etc)? A Value Reader can do that!
  • Do you need to extract a value from a barcode? A Value Reader can do that!
  • Do you need to extract the label next to a checked checkbox? A Value Reader can do that!

Do you need to find any value at all? You're going to use some kind of configuration of the Value Reader to do it.

Value Readers and Value Extractors

Value Readers locate results using a variety of "Value Extractors". Value Extractors are the primitive operators that perform data extraction in Grooper. Each different Value Extractor uses a different method to return a list of data values from the text (and sometimes visual) content of a page or document.

  • Value Extractors are often just called "extractors" for shorthand.

As a node, Value Readers are effectively a referenceable encapsulation of a single Value Extractor configuration.

Value Readers have only one configurable property "Extractor". This property lets you choose and then configure which Value Extractor you want it to use.

Example:

  1. This is a Value Reader in the node tree.
  2. This is its "Extractor" property.
  3. Use the dropdown menu (menu) to select which Value Extractor you wish to use.


Value Reader Tester

The Value Reader's "Tester" tab allows you to both configure the selected Value Extractor and test it against documents/pages in an test Batch.

  • This UI is inherited from the Extractor editor. You can also open this UI by opening the Extractor property's editor (press the "..." button) from the Value Reader tab.

While each Value Extractor's configuration is different, the Tester UI can be divided into four sections.

  • The "Configuration" panel
  • The "Test Source" panel
  • The "Document Viewer" panel
  • The "Results List" panel


Configuration
This is where you configure the Value Extractor.
  • This panel may have multiple tabs.
    • All Value Extractors have a "Properties" tab.
    • Many Value Extractor have an "Expressions" tab as well.
  • By default, configurations are tested automatically every time you select a page/document. You can change this by adjusting the Automatic Testing slider (toggle_on). When off, you can manually test the extractor with the Test button (play_circle)
  • After testing, the Diagnostics button (insert_chart_outlined) will light up. Press this button to review logs and other diagnostics about the extractor's operation. This is useful for troubleshooting.
Test Source
This control allows you to select a test Batch in Grooper. Use the Browse Tree button (account_tree) to select a Batch in the Grooper Repository's "Batches > Test" branch.
  • If you have documents in an AI Search index, you can use the Search button (search) to search for documents as well.
Document Viewer
The Document Viewer is used throughout Grooper to visually inspect documents/pages in a Batch.
  • This is where you will see Value Extractor's results highlighted on the page/document.
  • The "Renditions" button allows users to select different document renditions. This includes a Batch Page's image, a Batch Folder's attachment file, a Batch Folder's children, and the "Text" rendition displaying OCR or native text obtained by the format_letter_spacing_wide Recognize activity. Press the drop down arrow (arrow_drop_down) in the upper right corner of the Document Viewer to show all available renditions.
Results List
After testing extraction, results appear in this panel.
  • Select a result to highlight it on the page/document in the Document Viewer.
  • With a result selected, you can press the Instance Inspector button (flashlight_on) to launch the Data Inspector.

Value Extractor options

There are over 20 different types of Value Extractors.

Category Value Extractors Comments

Text parsing extractors

  • Pattern Match
  • List Match
  • Label Match
  • Word Match
  • Labeled Value
  • Field Match

These Value Extractors primarily rely on regular expression, lists of values (such as a Lexicon of field labels) or other forms of text parsing to return values.

  • Please note, regular expression and other forms of text parsing is the "bread and butter" of how Grooper data extraction works. Other extractors may also utilize regex or other forms of text parsing as part of their configuration. These extractors just rely on it more heavily.

LLM-based extractors

  • Ask AI

These Vale Extractors use generative AI to return results. The document text and other prompts defined by the user is fed to a large language model (LLM) for analysis. To utilize these extractors, you must add an "LLM Connector" Repository Option to the Grooper Root.

OMR extractors

  • Labeled OMR
  • Ordered OMR
  • Zonal OMR

These Value Extractors allow you to return values using optical mark recognition. These are useful for extracting values on documents that use checkboxes to detail information.

Barcode extractors

  • Find Barcode
  • Read Barcode

These Value Extractors allow you to return a value stored in a barcode.

Zonal extractors

  • Highlight Zone
  • Read Zone
  • Detect Signature

These Value Extractors are used to draw a logical rectangle somewhere on a document and return the text falling inside. These are useful for extracting values on highly structured documents where field values are consistently located on the same position on the page for every document.

Text analysis extractors (experimental)

  • Entity Recognition
  • Key Phrase Recognition
  • PII Entity Recognition

These extractors use the Azure AI Language cloud service to analyze document text. To utilize these extractors, you must add a "Text Analysis" Repository Option to the Grooper Root.

  • BE AWARE: These features are still in development and should be considered "experimental". They have not been extensively tested or implemented.

Miscellaneous extractors

  • Query HTML
  • Read Metadata
  • Select Page

These extractors have specialized uses and don't fit in well into the other categories.

The Reference extractor

  • Reference

The Reference extractor is unique among the Value Extractors. It allows users to reference the results of an extractor node (such as a Data Type or Value Reader).

Value Extractors: Text parsing extractors

Pattern Match

The Pattern Match extractor relies on regular expression (regex) pattern matching to return values. This is truly the foundation for almost all data extraction in Grooper. A regex pattern entered in the "Value Pattern" will run against the selected document, page, or data instance's text data. Matching results will be returned as this extractor's values.

You can also enter "Prefix" and "Suffix Pattern" expressions. This will return data if the text matched by the Value Pattern if the Prefix and/or Suffix Pattern matches. These are useful for anchoring the value you want to return next to some other piece of text.

Example: Set the "Prefix Pattern" to \n to only return results at the start of a new line.

The "Output Format" allows you to alter the output result for data cleansing or other purposes.

The "Properties" tab allows you to further configure the regex extraction. Here, you can enable Tab Marking, Fuzzy RegEx mode, filter results based on page location, determine case sensitivity, use a Lexicon to perform lookup operations, and more.


In this example, a Value Reader is configured to return currency values, using the Pattern Match extractor.

  1. The Value Reader's "Extractor" property is set to Pattern Match.
  2. The "Value Pattern" is entered here.
    In this case, the regex pattern \d{1,3}(,\d{3}){0,2}\.\d{2} matches decimal values from 0.00 to 999,999,999.99.
  3. The "Prefix Pattern" is entered here.
    Here, an optional space padded dollar sign.
  4. The "Suffix Pattern" is entered here.
    The [^%] matches anything not a percent sign, throwing out percentage values.
  5. The "Output Format" is formatted here.
    Unused in this example.
  6. Properties are configured using the "Properties" tab.
    Unused in this example.

List Match

The List Match extractor returns values matching one or more items in a defined list. This could be used to match a list of field labels on a form, a list of company names, a list of document titles, or any other list of words or phrases. You can even enable the use of regular expression syntax to match a list of regex patterns.

You can enter "Prefix" and "Suffix Patterns" to only return an item in the list if a regex pattern also matches before or after. These are useful for anchoring the value you want to return next to some other piece of text.

Example: Set the "Prefix Pattern" to \n to only return results at the start of a new line.

The "Output Format" allows you to alter the output result for data cleansing or other purposes.

The "Properties" tab allows you to further configure the list matching. Here, you can enable Tab Marking, Fuzzy RegEx mode, filter results based on page location, determine case sensitivity, reference a dictionary Lexicon for the list, and more.


In this example, a Value Reader is configured to return the field labels in the top portion of this Closing Disclosure form, using the List Match extractor.

  1. # The Value Reader's "Extractor" property is set to List Match.
  2. The list is entered in the "Local Entries" editor.
    • In this case, one label after another, line by line.
  3. The "Prefix" and "Suffix Patterns" are entered here.
    • Here, \s is used for the Prefix and Suffix Patter. A list item will only match if there is a whitespace character(\r, \n, \t, \f or a single space character) before and after it.
  4. Properties are configured using the "Properties" tab.
    • Unused in this example.


For larger lists, a Lexicon is often used instead of (or along with) the Local Entries list. This allows you to point to a Lexicon node instead of manually entering items in the Local Entries editor. This is excellent when matching large lists of items or when a single list is used by multiple extractors.

Example:

  1. The Lexicon highlighted here...
  2. ...has 46 entries for various line items that may or may not appear in the "Loan Cost" section of a Closing Disclosure.


We could create a Value Reader configured to return any of these items just by pointing to the Lexicon.

  1. This is done in the "Properties" tab.
  2. The Lexicon is referenced using the List Match extractor's "Vocabulary" properties.
  3. Using the "Included Lexicons" property, you can reference one or more Lexicons.

Label Match

  • The Label Match extractor is a very niche extractor. It only has use when enabling Label Sets. This section will assume you already have some familiarity with Grooper's Label Set functionality.

The Label Match extractor is extremely similar to the List Match extractor in that it matches one or more items in a defined list. The only difference is in its fuzzing matching settings.

When a Labeling Behavior is enabled, it defines fuzzy matching settings for labels in the label sets. The Label Match extractor will simply use those fuzzy match settings when matching list items. This lets you utilize a single set of fuzzy match settings across the labels matched through Label Set-enabled features and any Label Match extractor.

  • For more information on fuzzy matching, visit the Fuzzy RegEx article.

For the Label Match extractor to return a result, two conditions must be met.

  1. The document folder must be classified.
    In other words, it must have a Document Type assigned to it.
  2. Its Document Type must have a Labeling Behavior enabled.
    Either on the Document Type or, most typically, its parent Content Model.


  1. The Content Model selected here, has enabled the Labeling Behavior.
  2. The Labeling Behavior is enabled using the Behaviors property...
  3. ...and added using the collection editor seen here.
    For more information on the Labeling Behavior, how to enable it, its configuration, and its utility, visit the Labeling Behavior article.
  4. The Label Match extractor will use all the fuzzy matching and text wrapping settings defined here.


In this example, a Value Reader is configured to return a small list of field labels on an invoice, using the Label Match extractor.

  1. The Value Reader's "Extractor" property is set to Label Match.
  2. The list is entered in the "Local Entries" editor (just like you do with the List Match extractor).
    • Or, you can reference a Grooper Lexicon using the "Properties" tab.
  3. The "Prefix" and '"Suffix Patterns" are entered here.
    • ^|[^\w] is the default Prefix Pattern.
    • $|[^\w] is the default Suffix Pattern.
  4. The document we have selected is classified as an "Invoice" Document Type.
  5. This is a Document Type in the Content Model with the Labeling Behavior enabled.
  6. Upon execution, notice some results are returned with a confidence below 100%.
    • This is due to the fuzzy matching settings inherited from the Labeling Behavior. Its "Label Similarity" was set to 90%. Any items in the list with a fuzzy matching score above 90% are returned. Any falling below 90% are not returned.
    • Note: This means changing the Labeling Behavior settings will impact ALL Label Match extractors in scope.

Word Match

The Word Match extractor is designed for n-gram extraction.

An n-gram is "a contiguous sequence of n items from a given sample of text or speech". The Word Match extractor can capture 1-grams (single words) and up to 5-grams (five word phrases). dictionary Lexicons are commonly used to dictate a dictionary of allowable returned words. This could be general Lexicon of common English words or a custom Lexicon, such as one with industry specific terms.

FYI

An n-gram is often referred to by a different name depending its n size.

1-grams (single words) - unigrams
2-grams (word pairs) - bigrams
3-grams (three word phrases) - trigrams
4-grams (four word phrases) - four-grams
5-grams (five word phrases) - five-grams

As an additional FYI, four-grams are not called "tetragrams" because the term already has usage as a single word consisting of four letters or characters. "Quadrigram" is occasionally used, but four-gram is the more common terminology. Five-grams are not called "pentagrams", because that already has common usage for a geometric figure.

You can enter "Prefix" and "Suffix Patterns" to only return an n-gram if a regex pattern also matches before or after. These are useful for anchoring the n-gram you want to return next to some other piece of text.

Example: Set the "Prefix Pattern" to \n to only return n-grams at the start of a new line.

The "Join Pattern" property is unique to the Word Match extractor. This determines how terms of n-grams can be joined. Most often, terms (or grams) are simply joined by a single space, as in the bigram "first second".

  • Leave "Join Pattern" unconfigured and Grooper will only match terms separated by a single space character.
  • Configure the "Join Pattern" to match terms joined by other characters.
    Example: Set "Join Pattern" to [ -] to match terms separated by spaces and hyphens. This would allow Word Match to match "first second" as well as "first-second".

The "Output Format" allows you to alter the output result for data cleansing or other purposes.

The "Properties" tab allows you to further configure the n-gram matching. Most importantly, the n-gram size is set here as well as any Lexicon used to lookup against the returned values. You can also enable Tab Marking, Fuzzy RegEx mode, filter results based on page location, determine case sensitivity, and more.


In this example, a Value Reader is configured to return bigram field labels, using the Word Match extractor.

  1. The Value Reader's "Extractor" property is set to Word Match.
  2. The "Word Pattern" is entered here.
    • The regex pattern entered here is used to match each single gram in the n-gram.
    • The default pattern is \p{L}+. This matches any combination of letter characters in any language of any length.
    • In most cases, this pattern will perfectly suit your n-gram extraction needs. However, you can alter this pattern if you need. For example, [a-zA-Z]+ is a very similar pattern that could be used to match English only words, as it does not include characters of foreign scripts. For example, it would not match Greek characters, such as Ω, where \p{L}+ would.
  3. The "Prefix Pattern" is entered here.
    • In this case, the pattern entered will only match n-grams if they are preceded by a \n \t or beginning of string ^ character.
  4. The "Suffix Pattern" is entered here.
    • In this case, the pattern entered will only match n-grams if they are followed by a \r \t or end of string $ character.
  5. The "Join Pattern" is entered here.
    • The pattern here, [ \-] will return n-grams whos grams are separated by a single space character, a backspace, or a hyphen. If left blank, only n-grams whose grams are separated by a single space character are returned.
  6. The "Output Format" is formatted here.
    • Unused in this example.


In this case, we also used the "Properties" tab to set the n-gram size to collect bigrams, and only return grams in a English language dictionary.

  1. Navigate to the "Properties" tab.
  2. The "Word Lookup" property can be used to reference a Lexicon of allowable terms for each gram in the n-gram.
    • Here, we've referenced the "English Words" Lexicon that ships with every Grooper install in the "Essentials" Project.
  3. The "Phrase Size" property allows you to specify the size of the n-gram.
    • Here, it is set to 2 to capture bigrams.

Labeled Value

As the name implies, the Labeled Value extractor is designed to return labeled values. A common feature of structured forms is to divide information across a series of fields. But it's not as if you just have a bunch of data randomly strewn throughout the document. Typically, the field's value will be identified by some kind of label. These labels provide the critical context to what the data refers to.

Labeled Value relies on the spatial relationship between the label and the value. Most often labels and their corresponding values are aligned in one of two ways.

1. The value will be to the right of the label.

2. The value will be below the label.

Labeled Value uses two extractors itself, one to find the label and another for the value. If the two extractors results are aligned horizontally or vertically within a certain amount of space (according to how the Labeled Value extractor is configured), the value's result is returned.


In this example, a Value Reader is configured to return the "Cash to Close" amount as described on the first page of a Closing Disclosure form, using the Labeled Value extractor.

  1. The Value Reader's "Extractor" property is set to Labeled Value.
  2. The label is returned by the "Label Extractor".
    • Commonly, List Match or Pattern Match extractors are used here.
    • Here, a List Match extractor is used to return "Cash to Close"
  3. The value is returned by the "Value Extractor" configuration.
    • "Value Extractor" here is a property name (for which one of the 20+ Value Extractor types in Grooper is selected and configured).
    • Here, a Reference extractor is referencing a Value Reader that returns currency values.
  4. The "Layout" properties in general determine the spatial relationship between the label and the value.
    • The "Maximum Distance" property determines the distance the value can be from the label. Values that fall outside this distance will not be returned.
    • The "Maximum Noise" property is used to throw out results with text between the label and value.
  5. In the Document Viewer, the label extracted by the "Label Extractor" is outlined in blue.
  6. In the Document Viewer, the corresponding value extracted by the "Value Extractor" is highlighted in green.
    • Grooper only returns the value in the result list. The label instance only exists to help highlight the result in the Document Viewer.


The Labeled Value extractor has some increased functionality when used in combination with a Labeling Behavior. For more information on the Labeling Behavior and how Labeled Value benefits from it, visit the Labeling Behavior article.

Field Match

About

The Field Match extractor allows you to match a value stored in a previously extracted Data Field.

Instead of matching a regex pattern and returning a value, this extractor will match the Data Field's text and return a value. Think of the Data Field's value as the "pattern" matched against the document.

This could be used for data validation purposes. We've been using Closing Disclosure forms for most of these examples. Let's say the original imported file we imported for processing was named after the loan's borrower. So if John Doe was applying for the loan, the Closing Disclosure would be named "John Doe.pdf". You might want to know if the borrower's name according to the filename actually lines up with the borrower listed on the document. You could do just that with a Field Match extractor and a Data Field that captures the document's attached file name.


Here, we've created a Data Field that uses a "Default Value Expression" to return the name in the attached PDF's file name.

  1. This is the Data Field. We named it "Borrower From File Name".
  2. The "Default Value" expression is configured to return everything in the file name except the file path.
    • Match(Folder.AttachedFileName, ".*(?=\.pdf)")


  1. The original PDF file imported into Grooper is named "Eddie Kusick and Ody Boeck.pdf"
  2. The expression populates the Data Field with everything in that file name but ".pdf".


Using a second Data Field, we're going to use a Field Match extractor to make sure the information returned by the "Borrower From File Name".

  1. This is that Data Field. We named it "File Name Borrower Validation".
  2. Its "Value Extractor" property is set to reference a Value Reader using the Field Match extractor.
    • We will go over how this is configured later.
  3. We also set this Data Field's "Required" property to True.
    • This will flag the field if no value is populated.
    • If the "Borrower From File Name" field's value is not found on the document, the Field Match extractor will fail to produce a result. So, the field will be flagged. In this dummy scenario, this indicates the name on the file name doesn't match the borrower's name on the document.

BE AWARE: Testing Field Match in isolation will never yield a successful result. You must test it on the Data Model.

There's a specific order of operations that needs to happen before the Field Match extractor can return a result.

  1. The Data Field supplying the result to be matched needs to execute.
  2. The Data Field executing Field Match can execute at any point after.

The Field Match extractor can't match the Data Field's value if it hasn't been found yet!


  1. We can now test this result on the Data Model.
  2. Navigate to the "Tester" tab and press the Test button (play_circle).
  3. The "Borrower From File Name" Data Field returns the borrower's name from the file name.
  4. The "File Name Borrower Validation" Data Field uses a Field Match extractor, matching the "Borrower From File Name" Data Field's result with text found on the document.


  1. In the case of this document, the original file's name is "Bad Name.pdf"
  2. Upon testing extraction...
  3. The Field Match extractor fails to produce a result. The result "Bad Name" is not returned by the extractor.
  4. The borrower's name on the document is "Cindi Truwert and Audrey Feak"

Configuration

This is the Value Reader using the Field Match extractor described in the section above.

  1. Select the "Value Reader" tab
  2. Set the "Extractor" property to Field Match.
  3. Expand the "Extractor" property and select the "Field" property to reference the Data Field whose value you're matching on the document.
  4. Using the dropdown menu, choose the Data Field you want to match.


As far as the example above goes, that's it. That's all we did to get the results seen in this example. If the value returned by the "Borrower From File Name" Data Field is found on the document, it will be returned. If not, not result will be returned.


However, let's look at a couple other things you may want to consider when using the Field Match extractor.

  1. Select the "Expressions" tab.
  2. The "Value Pattern" in this case is optional. This can act as a "fall back" pattern if the Data Field's result is not matched. Grooper will prioritize returning the Data Field's value if it matches it on the document. However, if it does not, and the regex entered in the Value Pattern does match something, the Value Pattern's result will be returned.
  3. The "Parse Pattern" will parse the Data Field's result using a regular expression pattern.
    • For example, the regex pattern here [A-Z][a-z]+ [A-Z][a-z]+ would cause the extractor to return "Eddie Kusick" instead of "Eddie Kusick and Ody Boeck".
    • CAUTION!!! Grooper's regex is case insensitive by default in most cases. Not so with the "Parse Pattern". The Parse Pattern's regex is always case sensitive.
  4. The "Prefix" and "Suffix Pattern" can be used to anchor the result to another regex pattern (just like a Pattern Match extractor)
    • Here, we really want to make sure the name in the file name matches the borrower's name. The pattern Borrower\s will cause the Field Match extractor to only return the Data Field's value if it is preceded by this Prefix Pattern.

BE AWARE: Testing Field Match in isolation will never yield a successful result. You must test it on the Data Model.

There's a specific order of operations that needs to happen before the Field Match extractor can return a result.

  1. The Data Field supplying the result to be matched needs to execute.
  2. The Data Field executing Field Match can execute at any point after.

The Field Match extractor can't match the Data Field's value if it hasn't been found yet!

Value Extractors: LLM extractors

Ask AI

Ask AI is a Value Extractor that executes a chat completion using a large language model (LLM), such as OpenAI's GPT models. It uses a document's text content and user-defined instructions (a question about the document) in the chat prompt. Ask AI then returns the response as the extractor's result. Ask AI is a powerful, LLM-based extraction method, that can be used anywhere in Grooper a Value Extractor is referenced. It can complete a wide array of tasks in Grooper with simple text prompts.

The general idea behind Ask AI is simple: prompt an LLM chatbot with a question about a document to return a result. With appropriate instructions, Ask AI can even parse JSON responses into a Data Model's instance hierarchy (For example, a Data Table and Data Column instance hierarchy).

Ask AI Pros and Cons

Pros

  • Returns data using a natural language prompt.
  • Less knowledge of Grooper extractors required to return data
  • Quicker time to value.
  • Easier to maintain over time.

Cons

  • LLM responses can be unpredictable.
  • LLM responses can be inaccurate.
  • LLMs answer by predicting the next best word one after another. They do not “know” anything.
  • As an extractor, "Ask AI" must be configured “per field”.
  • An API call is made every time the extractor executes.
    • This is an important consideration. Limiting the text sent to the LLM whenever possible should be a consideration. The "Context Extractor" property can play a big role in helping with this.
  • No result highlighting (at least for typical configuraitons)

Properties

Model
The API Key you use will determine which GPT models are available to you. The different GPT models can affect the text generated based on their size, training data, capabilities, prompt engineering, and fine-tuning potential.
Parameters
Please see the Parameters article for more information.
Instructions
The instructions or question to include in the prompt. The prompt sent to OpenAI consists of text content from the document, which provides context, plus the text entered here. This property should ask a question about the content or provide instructions for generating output. For example, "what is the effective date?", "summarize this document", or "Your task is to generate a comma-separated list of assignors".
Preprocessing
Please visit the Text Preprocessor article for more information.
Context Extractor
An optional extractor which filters the document content included in the prompt.
Max Response Length
The maximum length of the output, in tokens. 1 token is equivalent to approximately 4 characters for English text. Increasing this value decreases the maximum size of the context.
Parse JSON Response
If this property is enabled, JSON returned in the response will be parsed into a Data Instance hierarchy.
  • Use this mechanism to capture complex data and generate output instances with named children. This produces output instance similar to a "Pattern Match" extractor using named groups, or a Data Type using using "Ordered Array" collation. This type of Data Instance hierarchy can be consumed by the "Row Match" Table Extract Method or the "Simple" Section Extract Method.
  • When this property is enabled, this instructions should ask the AI to respond with JSON, and provide instructions and examples as need to ensure the AI understands the desired JSON format. The JSON may contain a single JSON object or an array of JSON objects.
  • If a single JSON object is returned, a single output instance will be generated, containing one named child for each property of the JSON object. This type of output would be appropriate for capturing a single-instance Data Section.
  • If a JSON array is returned, one output instance will be generated for each object in the array. Each output instance will have named children reflecting the properties of the JSON object. This type of output is appropriate for capturing a Data Table or multi-instance Data Section.
  • The AI must respond with JSON only, or with the JSON delimited using the prefix and suffix shown below. OpenAI models are typically trained for this out of the box, but models trained by other organizations may require special instructions.

Value Extractors: OMR extractors

OMR stands for "optical mark recognition". Many structured forms utilize checkboxes in order to detail information. Filling out a form is much quicker if you can just check a box to indicate a choice from a list of options rather than printing a response. It also makes it easy on you if you're presented the list of possible options, even if that is as simple as checking a box next to "Yes" or "No".

However, checkboxes don't translate to a text character when recognizing a document's text data through OCR or native next extraction. In order for Grooper to understand if a checkbox is checked, it must digitally recognize the box and its "check state", checked or unchecked. OMR is this digital process of recognizing checkboxes and their check states.

In previous versions of Grooper (pre 2021), checkboxes and their check states were only determined from layout data detected by a "Box Detection" or "Box Removal" IP Command. Layout data collection is still important for the three OMR extractors. However, Labeled OMR is able to determine check states without layout data (including for circular "radio button" style OMR fields).

Because it can detect circular checkboxes and its relative simple setup, Labeled OMR' is generally preferred over Ordered OMR and Zonal OMR.

  • Ordered OMR and Zonal OMR still exist as fallback options when Labeled OMR cannot produce a desired result.
  • Both Ordered OMR and Zonal OMR require more configuration than Labeled OMR.
  • Zonal OMR requires the most setup of all three OMR extractors.
  • Both Ordered OMR and Zonal OMR require the presence of layout data to function.

About obtaining "layout data"

Layout Data is visual information on a document obtained during an image processing operation. This includes checkboxes and their check states, line locations, and barcodes.

Layout data is collected by IP Profiles when executed by the wallpaper Image Processing or format_letter_spacing_wide Recognize activity. There are several IP Commands that obtain layout data pertinent to data extraction.

IP Profiles will permanently alter an image or only temporarily adjust an image to improve OCR accuracy depending on which activity executes it.

  • For permanent image processing, the IP Profile is executed by the Image Processing activity.
  • For temporary image processing, the IP Profile is executed by the Recognize activity.
    • Recognize will execute IP Profiles in one of two ways:
    • By an IP Profile referenced in an OCR Profile (useful for collecting layout data from scanned and other image-based pages)
    • By one assigned to its "Alternate IP" property (useful for collecting layout data from digitally native pages).

In either case, if the 'IP Profile contains layout data collecting IP Commands, layout data is stored in the Batch Page's "Grooper.Layout.json" file. For example, Box Detection and Box Removal will store checkbox locations and their check states (either checked or unchecked). The OMR extractors will then use this layout data to return labels next to checked boxes.

  1. Here, we have created an IP Profile named "Layout Data".
  2. This IP Profile's second step uses the Box Removal command.
    • Whether using Box Detection or Box Removal Grooper will detect checkboxes and save their layout data. Box Removal will digitally remove the checkbox from the page's image whereas Box Detection will not.
  3. The box detection settings are determined by the "General Settings".
  4. In the "Image Diagnostics" panel, the Box Removal's "Execution Log" summarizes the command's results.
  5. Box Removal's "Detected Box Info" diagnostic shows each box's locations and check states.
  6. The "Boxes" diagnostic image will show you visually where checkboxes are on a page. Detected unchecked boxes will be highlighted in red. Checked boxes are highlighted in green.


FYI: Layout Data Verification

How can you verify layout data is saved if you're not testing an IP Profile/IP Step? Where does this information live in Grooper?

Layout data (included checkbox locations and check states) are saved in the Batch Page's "Grooper.Layout.json" file. You can verify this the Batch Page's "Advanced" tab.

  1. Select the processed Batch Page in the node tree.
  2. Navigate to the "Advanced" tab.
  3. Navigate to the "Files" tab.
  4. Double click the the "Grooper.Layout.json" file to open it.
    • If this file is not present, either an IP Profile has not been executed, it has no layout data collecting commands, or layout data was detected.


  1. The detected checkbox information is stored in the JSON file in a way Grooper can quickly access.

Labeled OMR

The Labeled OMR extractor is designed to be the easiest OMR extractor to set up. In most cases, it's as simple as define the list of labels next to the checkboxes, determine if multiple boxes can be checked, just one, or if a checked box evaluates to a binary true/false value, and the label (or labels) next to the checked boxes are returned.

The Labeled OMR has just two properties necessary for its configuration: "Label Extractor" and "Mode"

The "Label Extractor" serves to return any labels next to a checkbox. In many cases, this is a simple as using a List Match extractor, entering a list of the text labels on the document.

The "Mode" property corresponds with how the checkboxes behave on the document. Can just one checkbox out of many be checked? Can multiple? Is there just one checkbox where it being checked means one thing and unchecked another? This can be one of three options: CheckOne, CheckMulti or Boolean

  • CheckOne will target multiple checkboxes but presumes only one may be checked. This is for when documents present a list of options, only one of which may be chosen.
  • CheckMulti will target multiple checkboxes and presumes any number of them may be checked. This is for when documents present a list of options, but any one of them can be chosen.
    • All labels are returned as a single concatenated result. This results may be separated by a "Separator String". For example a , could be used to create a comma separated list of checked values.
  • Boolean targets only a single checkbox, presuming the checkbox represents a Boolean "true or false" answer. This is for when documents present a single checkbox where checking the box indicates one thing and leaving it unchecked means another.
    • By default the result will evaluate to either True or False but this can be altered using the "Value If Checked" and "Value If Unchecked" properties.


In this example, a Value Reader is configured to return the type of loan applied for on a "Closing Disclosure" form, using the Labeled OMR extractor.

  1. The Value Reader's "Extractor" property is set to Labeled OMR.
  2. The "Label Extractor" is configured to return the checkbox labels.
    • Here, we used a simple List Match extractor with our three loan types in its Local Entries list (We're ignoring the fill-in "Other" option for the sake of simplifying this example).
      Conventional
      FHA
      VA
    • List Match is commonly used for Label Extractor configurations. However, any Value Extractor can be used as long as it returns one individual result for each checkbox label.
  3. Configure the "Mode" property according to how the checkboxes behave on the document.
    • Here, there can only be one type of loan for each individual Closing Disclosure. You'd never have a home loan that is both a conventional loan and an FHA loan. So, it is set to CheckOne.
  4. In the "Document Viewer", labels are outlined in blue and detected checkboxes are highlighted in green.
  5. The label next to the checked box is returned.
    • Technically, for CheckOne mode, all these labels are returned, but the checked label is returned at 100% confidence where the unchecked ones are returned at 0%. This extractor will always return the most confident value first which is ultimately what we want.
    • Note: If multiple boxes are checked with CheckOne selected, all labels will return with 0% confidence.


Layout Data and Labeled OMR

Labeled OMR is unique among the OMR extractors in that layout data is not necessary in order for it to function. It will use layout data if present. If not, it will make run its own box detection pass to determine if a checkbox is next to a label.

This is exceptionally useful for non-standard checkboxes difficult (or impossible) for Box Detection to detect.

Example: Box Detection cannot detect radio buttons.
  • Radio buttons are circles, not boxes.
  • Box Detection only detects boxes. There's no way for it to detect circular "checkboxes" like radio buttons.

Do not worry. Labeled OMR has detection capabilities that allow it to detect radio buttons and other non-standard checkboxes.


Here, the Labeled OMR extractor is unchanged from the example described above.

  1. However, this document uses radio buttons instead of checkboxes to detail the Closing Disclosure's loan type.
  2. In both cases, the correctly checked label is returned.

Ordered OMR

Ordered OMR returns information for multiple check boxes within a defined zone based on their order and layout. The zone may be optionally fixed on the page or anchored to a static text value (such as a label).

The Ordered OMR extractor is a little more complicated to set up than Labeled OMR but can be used in cases where Labeled OMR is not producing the desired results. However, checkbox data must be present in the document's layout data. Ordered OMR will not function without this data present before executing.

Furthermore, Ordered OMR assumes the checkboxes will be ordered one after the other either vertically or horizontally along a single line. For documents with sections of checkboxes broken up into multiple columns (for vertically ordered checkboxes) or multiple lines (for horizontally ordered checkboxes), multiple Ordered OMR extractors may be necessary (one for each column or line of checkboxes).

For Ordered OMR, you will indicate where boxes are on a document by drawing a rectangular zone around the checkboxes. All checkboxes must fall within the drawn zone. This zone is configured using the "Location" property. This can be one of four options:

  • Fixed Region - This option is the simplest to set up. As the name implies, the rectangular zone will be fixed on the page. It will stay in the same coordinates for every document. All you need to do is draw the box around the checkboxes.
  • Relative Region - Instead of setting the zone in a fixed location for every document, the Relative Region mode will anchor the zone to a text label on the document. The zone's position will change relative to the label's position on the document, but will still have the same drawn dimensions.
    • This option is useful to overcome issues arising during scanning printed documents. Slight variations can occur as to where a value is when printing or scanning a document, even for very structured documents. This can cause problems when drawing a single fixed region for the zone. However, if you can anchor the zone off an extractable text value, the zone's position will shift according to that anchor's position.
  • Text Region - The Text Region option creates a rectangular zone using the logical boundaries of an extraction result. This can create the zone within the boundaries of the extractor's result.
    • This can also be configured to provide results in a similar way the Relative Region option does, using text anchors located by an extractor to position the extraction zone's location. This means both methods can be used to position the zone relative to a point from document to document. The main difference is in how the zone is drawn.
  • Shape Region - The Shape Region option is extremely similar to the Text Region option. However, instead of using text to anchor the extraction zone, it uses a shape detected from a Shape Detection or Shape Removal command.
    • This is the least common method used.

Ordered OMR must also have its "Mode" property configured. This property behaves the same as Labeled OMR. It determines how many checkboxes should be checked for the checkboxes falling within the rectangular zone. This can be one of three options:

  • CheckOne will target multiple checkboxes but presumes only one may be checked. This is for when documents present a list of options, only one of which may be chosen.
  • CheckMulti will target multiple checkboxes and presumes any number of them may be checked. This is for when documents present a list of options, but any one of them can be chosen.
    • All labels are returned as a single comma-separated result.
  • Boolean targets only a single checkbox, presuming the checkbox represents a Boolean "true or false" answer. This is for when documents present a single checkbox where checking the box indicates one thing and leaving it unchecked means another.

Ordered OMR is different from Labeled OMR in that the output values are not located by a Label Extractor. Instead, you will use the "Output Values" property to list each checkbox's corresponding value. Ordered OMR presumes the checkboxes will be stacked on top of each other or next to each other. They will either be stacked on top of each other in a single column, or they will be ordered next to each other across a single horizontal line. Using the Output Values property, you will enter a comma-separated list of the checkboxes' labels from top to bottom or left to right.

  • When selecting the Boolean mode, only two values may be entered.

The "Flow Direction" property determines how they are ordered, either Vertical or Horizontal.

  • "Vertical" is appropriate for boxes stacked on top of each other.
  • "Horizontal" is appropriate for boxes next to each other along a horizontal line.


In this example, a Value Reader is configured to return the "This estimate includes" options for a "Closing Disclosure" form, using the Ordered OMR extractor.

  1. The Value Reader's "Extractor" property is set to Ordered OMR.
  2. You must configure the "Location" property. This determines where the zone is placed on each document. All checkboxes should fall in this zone.
    • In this case, we've selected "Fixed Region" and drawn a rectangle on the page. These forms are highly structured. We can assume the checkboxes will always fall within the same rectangular coordinates.


  1. You can see the drawn zone in green in the "Document Viewer" pane.
  2. Configure the "Mode" property according to how the checkboxes behave on the document.
    • Here, multiple boxes can be checked. The estimate could include property taxes, homeowner's insurance, other costs, or any combination of the three. So, it is set to CheckMulti.
  3. Each label is entered as a comma-separated list in the "Output Values" property.
  4. Whether the checkboxes are ordered vertically or horizontally is set by "Flow Direction" property.
    • Here, the checkboxes are stacked on top of each other. So, it is set to Vertical.

Zonal OMR

For Zonal OMR a rectangular zone must be drawn around the location of each individual checkbox (rather than a single zone for all checkboxes as is the case for the Ordered OMR extractor). This typically makes Zonal OMR the most time consuming of the OMR extractors as far as set up goes, but may be necessary to target checkboxes on forms whose order is not targetable by Ordered OMR (for example, checkboxes in non-standard orientations next to their labels) and/or whose labels cannot be extracted by Labeled OMR (for example, due to poor OCR).

Furthermore, checkbox data must be present in the document's layout data. Zonal OMR will not function without this data present.

Like all OMR extractors, Zonal OMR must have its "Mode" property configured. This determines how many checkboxes can be checked for the checkboxes falling within the rectangular zones. This can be one of three options:

  • CheckOne will target multiple checkboxes but presumes only one may be checked. This is for when documents present a list of options, only one of which may be chosen.
  • CheckMulti will target multiple checkboxes and presumes any number of them may be checked. This is for when documents present a list of options, but any one of them can be chosen.
    • All labels are returned as a single comma-separated result.
  • Boolean targets only a single checkbox, presuming the checkbox represents a Boolean "true or false" answer. This is for when documents present a single checkbox where checking the box indicates one thing and leaving it unchecked means another.
    • By default the result will evaluate to either True or False but this can be altered using the "Value If Checked" and "Value If Unchecked" properties.

The "Anchor" property allows all drawn zones for each checkbox to be anchored to an extractible text result. This is useful to overcome issues arising during scanning printed documents. Slight variations can occur as to where a value is when printing or scanning a document, even for very structured documents. This can cause problems when drawing a single fixed region for the zone. However, if you can anchor the zone off an extractable text value, the zone's position will shift according to that anchor's position.

The rectangular zones are drawn using the "OMR Boxes" property. This will bring up a collection editor to draw registration zones for each checkbox. Here, you will also enter the output result for each OMR zone. This editor also allows you to anchor zones to a text anchor for each individual zone, meaning you can anchor a single zone here (as well as anchoring the collection of OMR zones using the Anchor property described above).


In this example, a Value Reader is configured to return the type of loan applied for on a "Closing Disclosure" form, using the Zonal OMR extractor.

  1. The Value Reader's "Extractor" property is set to Zonal OMR.
  2. Configure the "Mode" property according to how the checkboxes behave on the document.
    • Here, there can only be one type of loan for each individual Closing Disclosure. So, it is set to CheckOne.
  3. The "Anchor" property is used to anchor the collection of OMR zones to a text result on the document.
    • Note: Configuring "Anchor" is optional. You can choose to configure an anchor extractor or not. In many cases, using one will increase the accuracy of your results, but it is not required. Furthermore, using the Anchor sub properties you can choose to make the anchor required to perform extraction or not if it fails to produce a result.
  4. We chose to anchor the collection of OMR zones to the text "Loan Type" in this case.
  5. One rectangular zone is drawn around each checkbox using the "OMR Zones" collection editor.


  1. Add one zone for each checkbox using the "Add" button.
  2. Use the "Value" property to enter the output result.
  3. Use the "Bounds" property to enter the zone's coordinates.
    • Select this property and press the ellipsis button at the end to lasso the zone with your cursor.

In this case, we have four checkboxes. So, we have added four items to our collection list here, one for each type of application.


Remember how we ignored the "Other" application type for our Labeled OMR example? The text entered for the "Other" application type could be anything, which makes it more difficult to extract a label. The text could be anything under the sun.

  1. With Zonal OMR, we didn't have to enter the label. We just drew a zone around the corresponding checkbox.
  2. Its output result is what we entered in this OMR box's Value property: Other.

Note: This doesn't mean you couldn't use Labeled OMR and still successfully extract the text label here (even though it could be anything!). It would just require a more complex extractor than we used for our earlier example.

Value Extractors: Barcode Extractors

Both the Find Barcode and Read Barcode extractors will return a barcode's encoded value as its result. Barcodes encode information according to the "symbology" used. Grooper has the ability to detect and read 29 different symbologies, including Code128, Code39, US postal barcodes, QR codes, and UPC barcodes.

Find Barcode and Read Barcode differ in terms of when Grooper's barcode detection occurs in the document processing pipeline.

Find Barcode - Barcode detection must occur before extraction.

  • Find Barcode just reads the document/pages's layout data file to locate the barcode and return its value. The barcode's value must be obtained by a "Barcode Detection" or "Barcode Removal" step in an IP Profile before the extractor executes.
  • Find Barcode executes very quickly because the "heavy lift" of detection has already been done.

Read Barcode - Barcode detection occurs during extraction.

  • No layout data is required. No previous "Barcode Detection" or "Barcode Removal" step is necessary.
  • This extractor's configuration is very similar to how "Barcode Detection" is configured. They both use the same underlying technology to locate barcodes.

Which one is right for you?

Barcode detection takes time. The choice boils down to when you want that additional processing time to take place.

  • Find Barcode - Detection must occur during an Image Processing or Recognize step. This will slow down those steps but speed up the Extract step (or whatever step is executing Find Barcode).
  • Read Barcode - Detection occurs during the Extract step (or whatever step is executing Read Barcode). This will slow down that step but free up compute power for Image Processing or Recognize.

Read Barcode

In this example, a Value Reader is configured to return the date encoded in a barcode, using the Read Barcode extractor.

  1. The Value Reader's "Extractor" property is set to Read Barcode.
  2. For any configuration, you must define which barcode symbologies are used in your document. This is configured using the "Detection Settings" properties.
  3. You must configure the at least one "Reader'".
    • Grooper has four barcode readers available: "Standard Reader", "1D Reader", "2D Reader", and "Postal Reader"
    • The "1D Reader", "2D Reader", and "Postal Reader" options work faster and generally provide better results than the "Standard Reader", but they are limited in the barcode symbologies they support.
    • In this case, "ID Reader" is enabled. The desired barcode uses the Code 39 symbology, which is supported by the 1D Reader.
  4. Use the Barcode Symbologies property to select which barcode symbology you wish to detect.
    • In this case, we have selected Code39.
    • You may select multiple barcode symbologies. However, this will slow down processing and may provide false positive results.
  5. Optionally, you may configure the "Value Pattern" property to write a regular expression pattern to validate the barcode's value.
    • In this case a simple regex matching a date format \d{1,2}/\d{1,2}/(\d{4}|\d{2}) is used to validate the returned value is a date.
    • The regex pattern written here does not parse the value. It just validates it. If the regex matches any portion of the barcode's value, the whole value is returned.
  6. When the Value Reader executes, the barcode is detected and its encoded value is returned.

Find Barcode

Configuration Prereqs - Obtain Layout Data

Layout Data is visual information on a document obtained during an image processing operation. This includes checkboxes and their check states, line locations, and barcodes.

Layout data is collected by IP Profiles when executed by the wallpaper Image Processing or format_letter_spacing_wide Recognize activity. There are several IP Commands that obtain layout data pertinent to data extraction.

In the case of Find Barcode, the barcode must previously be detected by the "Barcode Detection" or "Barcode Removal" IP Command. The barcodes data must be stored in the page's layout data.

  1. Here, we have created an IP Profile named "Layout Data". This IP Profile's first step uses the "Barcode Removal" command.
    • Whether using Barcode Detection or Barcode Removal, Grooper will detect barcodes and save their layout data. Barcode Removal will digitally remove the barcode from the page's image whereas Barcode Detection will not.
  2. You must define which barcode symbologies are used in your document. This is configured using the "Detection Settings" properties.
    • Read Barcode and Barcode Detection/Removal use the same technology to detect barcodes.
  3. You must configure the at least one "Reader".
    • Grooper has four barcode readers available: "Standard Reader", "1D Reader", "2D Reader", and "Postal Reader"
    • The "1D Reader", "2D Reader", and "Postal Reader" options work faster and generally provide better results than the "Standard Reader", but they are limited in the barcode symbologies they support.
    • In this case, "ID Reader" is enabled. The desired barcode uses the Code 39 symbology, which is supported by the 1D Reader.
  4. Use the "Barcode Symbologies" property to select which barcode symbology you wish to detect.
    • In this case, we have selected Code39.
    • You may select multiple barcode symbologies. However, this will slow down processing and may provide false positive results.
  5. In the "Image Diagnostics" panel, the Barcode Detection/Removal command's "Execution Log" will show you the results of the command.
  6. Here, you can see a "Code30" barcode was found, its positional boundaries, and its encoded value ("01/12/2020").

IP Profiles will permanently alter an image or only temporarily adjust an image to improve OCR accuracy depending on which activity executes it. An IP Profile can be executed in one of two ways:

  • For permanent image processing, the IP Profile is executed by the Image Processing activity.
  • For temporary image processing, the IP Profile is executed by the Recognize activity.
    • Recognize will execute IP Profiles in one of two ways:
    • By an IP Profile referenced in an OCR Profile (useful for collecting layout data from scanned and other image-based pages)
    • By one assigned to its "Alternate IP" property (useful for collecting layout data from digitally native pages).

FYI: Layout Data Verification

How can you verify layout data is saved if you're not testing an IP Profile/IP Step? Where does this information live in Grooper?

Layout data (included barcode locations and their decoded values) are saved in the Batch Page's "Grooper.Layout.json" file. You can verify this the Batch Page's "Advanced" tab.

  1. Select the processed Batch Page in the node tree.
  2. Navigate to the "Advanced" tab.
  3. Navigate to the "Files" tab.
  4. Double click the the "Grooper.Layout.json" file to open it.
    • If this file is not present, either an IP Profile has not been executed, it has no layout data collecting commands, or layout data was detected.


  1. The detected barcode information is stored in the JSON file in a way Grooper can quickly access.

Find Barcode

In this example, a Value Reader is configured to return the date encoded in a barcode, using the Find Barcode extractor. This document's layout data was previously obtained with the IP Profile described above during the Recognize activity.

  1. The Value Reader's "Extractor" property is set to Find Barcode.
  2. Since the barcode's value is already stored in the document's layout data file, all you need to do is define what barcode symbology you're looking for.
    • In this case, we're looking for a Code39 barcode value.
    • Note: You may also choose All to return any and all barcode values in the "Grooper.Layout.json" file.


  1. Because the barcode was detected before the extractor executes, it runs very quickly.
  2. Rather than digitally scanning the page for a barcode, it simply returns the already obtained information stored in the layout data file.

Value Extractors: Zonal extractors

Read Zone

The Read Zone extractor allows you to extract text data in a rectangular region (called a "extraction zone" or just "zone") on a document. This can be a fixed zone, extracting text from the same location on a document, or a zone relative to an extracted text anchor or shape location on the document.

Read Zone is useful for extracting data from highly structured documents. If a document's structure is fixed, it's going to have the same fields in the same physical location from one document to the next. The Closing Disclosure forms we've been looking at in this article are themselves fairly fixed. For example, the "Loan Amount" listed on the first page is more or less in the same spot for every single Closing Disclosure. The dollar amount itself may change, but there's only so much room that amount can take up on the document.

If you can draw a rectangle around the value you want to extract, and the value falls within the boundaries of that rectangle for every single document, extraction may be as simple as just extracting the text in the rectangle's location. This is referred to as "zonal extraction". You draw a zone where the value exists on the page and return the text data falling in the zone.

Read Zone has a few different options for where the box is placed using the "'Location" property. This can be one of four options:

  • Fixed Region - This option is the simplest to set up. As the name implies, the rectangular zone will be fixed on the page. It will stay in the same coordinates for every document. All you need to do is draw the box where you want to extract data.
  • Relative Region - Instead of setting the zone in a fixed location for every document, the Relative Region mode will anchor the zone to a text label on the document. The zone's position will change relative to the label's position on the document, but will still have the same drawn dimensions.
    • This option is useful to overcome issues arising during scanning printed documents. Slight variations can occur as to where a value is when printing or scanning a document, even for very structured documents. This can cause problems when drawing a single fixed region for the zone. However, if you can anchor the zone off an extractable text value, the zone's position will shift according to that anchor's position.
  • Text Region - The Text Region option creates a rectangular zone using the logical boundaries of an extraction result. This can create the zone within the boundaries of the extractor's result.
    • This can also be configured to provide results in a similar way the Relative Region option does, using text anchors located by an extractor to position the extraction zone's location. This means both methods can be used to position the zone relative to a point from document to document. The main difference is in how the zone is drawn.
  • Shape Region - The Shape Region option is extremely similar to the Text Region option. However, instead of using text to anchor the extraction zone, it uses a shape detected from a Shape Detection or Shape Removal command.
    • This is the least common method used.

The Read Zone extractor can optionally re-process the text data with an OCR Profile. This can be used to perform custom OCR on the zone being extracted.

The text in the zone can also be itself extracted by configuring its "Value Extractor" property. This allows you to break up the document into a smaller portion and run an extractor on just the zone instead of the full document. Essentially, you use the Read Zone extractor to create a smaller data instance (from the larger document data instance) and use its Value Extractor property to return data from the smaller data instance.


In this example, a Value Reader is configured to return the "Loan Amount" value as described on the first page of a Closing Disclosure form, using the Read Zone extractor.

  1. The Value Reader's "Extractor" property is set to Read Zone.
  2. You must configure the "Location" property. This determines where the extraction zone is placed on each document.
    • In this case, we configured the "Relative Region" option. We're using the text label "Loan Amount" as the anchor for the drawn extraction zone. You fully configure whichever Location mode you choose by expanding and configuring its sub-properties.
      • The extracted anchor is seen in the "Document Viewer" outlined in blue.
      • The extraction zone is seen highlighted in green. Any text falling within that green box is returned as the result.
  3. The "Output Full Region" property is very handy. It doesn't change the result at all, it just shows the full size of the drawn zone on the page. This is extremely useful when testing and configuring the Read Zone extractor.
  4. If "Output Full Region" were set to False, only the text ($ 159,432.62) would be highlighted, not the full drawn zone seen here.

Highlight Zone

The Highlight Zone extractor is unique in that it doesn't actually extract anything at all!

So, why use it? Highlight Zone can be useful for quickly calling attention to Data Fields requiring manual validation during a Review step.

Example: Handwritten fields on a form are unlikely to be recognized by OCR. OCR is designed to read machine printed characters. While there are some recent advancements in handwriting detection (such as Azure OCR), traditionally OCR engines tend to fail at recognize handwriting. In the case of fields written in by hand, you will likely need a human being to enter that information during Review.

However, if those fields are in the same spots on the document (or same spot relative to some text label), you can use Highlight Zone to draw the data entry clerk's attention to its location on the page. This will save them time and your organization money.

The highlighted zones are drawn using the exact same "Location" methods and property configurations as Read Zone.


In this example, a Value Reader is configured to highlight a signature field, using the Highlight Zone extractor.

  1. The Value Reader's "Extractor" property is set to Highlight Zone.
  2. You must configure the "Location" property. This determines where the highlighted zone is placed on each document.
  3. In this case, we used "Fixed Region". The drawn rectangle has the same coordinates (drawn using the "Bounds" editor) for every single document.
  4. We also put a "Page Filter" on the zone. Since the signature page is always on page 5 of the document, we set it to 5.
  5. This could be used to draw a data entry clerk's attention to the signature page to quickly verify if the document is signed.

Detect Signature

Detect Signature is a zonal extractor specifically designed to detect if a signature is present or not. It uses one of the four "Location" options (Fixed Region, Relative Region, Shape Region or Text Region) to draw an extraction zone on a geographic region of the page. Instead of extracting text from that box, it analyzes the image to determine if a signature is present.

Think about a signature line. If you drew a box around where you expect someone to sign, nothing would be in the box if it was not signed. But regardless of the signature, some of the box would be filled in if it were.

This is how Detect Signature works. Detect Signature determines if the zone is signed a simple pixel count. It turns the image black and white and counts the number of black pixels. If the number of black pixels falls above a certain percentage, the extractor returns a value of "Signed" and if below it returns a value of "Not Signed".


In this example, a Value Reader is configured to return whether or not the "Applicant Signature" is present on the Closing Disclosure form, using the Detect Signature extractor.

  1. The Value Reader's "Extractor" property is set to Detect Signature.
  2. You must configure the "Location" property. This determines where the extraction zone is placed on each document.
    • In this case, we used "Fixed Region".
    • The drawn rectangle has the same coordinates (drawn using the "Bounds" editor) for every single document.
  3. Optionally, you may pre-process the image with an IP Profile using the "IP Profile" property.
    • This can aid the signature detection process.
    • Many signatures may be small or faint. A common technique is to use an IP Profile with a "Dilate Erode" command to dilate (or "bloat") the signature's pixels, allowing the extractor to more easily detect the signature.
  4. The "Fill Percentage" property determines how many black pixels must be present in order for the extractor to consider the zone signed.
    • In this case, if at least 25% of the pixels in the extraction zone are black, the extractor will consider the zone "Signed".
  5. Optionally, you may change the returned value using the "Value If Filled" and "Value If Not Filled" properties.
  6. Here, more than 25% of the pixels in the extraction zone are black (or "filled"), and Detect Signature returns a value of "Signed".


FYI

Keep in mind the Detect Signature extractor always examines a pre-processed image (not the image seen in the Document Viewer) even when the "IP Profile" property is not configured.

Detect Signature requires a black and white image to work. Grooper knows a pixel is "filled" because it is black and not white. If the image is color or grayscale, Grooper will temporarily convert it to black and white on its own, even if you do not configure an IP Profile.

If you want control over how Grooper turns the image black and white, this is another reason you may want to us an IP Profile to customize how this is done, using either the "Threshold" or "Binarize" command.

Value Extractors: The Reference Extractor

The Reference extractor is just an extractor that's returning the results of another extractor node in the node tree.

You can use the Reference extractor to reference any of the three extractor node types:

  • Data Types
  • Value Readers
  • Field Classes

This can be useful to keep you from duplicating your efforts over and over again. For example, if you have a variety of different extractors needing to return a currency value, don't create a new extractor every time you need to return that data. Just create a single currency extractor (Value Reader or Data Type). Then, use the Reference extractor to use it and re-use it over and over.

In this example, we will create a Value Reader that references another Value Reader, using the Reference extractor.

  • This may seem like a somewhat silly example. Usually, a Data Type or an extractor property such as a Data Field's "Value Extractor" property will utilize the "Reference" option.
  1. The Value Reader's "Extractor" property is set to Reference.
  2. Use the Reference extractor's "Extractor" property to point to an extractor in the node tree.
    • You can reference a Data Type, Value Reader, or Field Class.
  3. When the Reference extractor executes, the results of the referenced extractor are returned.
  4. You can see the name of the extractor returning the value in the "Name" column.
  5. In this case, a Value Reader named "Pattern Match - Currency".

Value Readers vs Data Types

Before version 2021, the Data Type extractor node was considered the bread and butter of data extraction. For many older Grooper users a "data extractor" and a Data Type are synonymous. However, the Value Reader node was (in part) designed to be the "general purpose extractor" in Grooper. What then, is a Data Type's primary function in Grooper? Data Collation.

  • A Value Reader is a Grooper node designed for simple data extraction. They return the initial data set from the document.
  • A Data Type is Grooper node designed for complex data collation. They collect results from multiple extraction sources, process it further as needed, and return the final data set from the document.

Because data collation is so important for a variety of extraction techniques it's almost natural to equate collated data with extracted data. But, there's really two parts of what's going on. First, values are extracted from the document's text data then they are collated and finally returned by the Data Type.

While both Value Readers and Data Types are considered "extractors", they really have two different jobs as far as Grooper is concerned. One way to think of this is a Value Reader is a "data finder" while a Data Type is a "data manipulator".

  • It is a Value Reader's job to locate and return data from a document.
  • It's a Data Type's job to take that data and organize it, manipulate it or impose constraints on what counts as valid data.

Example: You want to make an extractor that returns date values stacked on top of each other.

  • A Value Reader can be made to find the date values. Set it to use a Pattern Match extractor with a regex that matches dates.
  • A Data Type can be made to manipulate those date values into an array. Set it to (1) reference the Value Reader and (2) collate it using the Array method.
  • The Data Type collates the Value Readers initial results and returns the final result.

FYI

A Data Type with only a "Local Extractor" configuration is identical in function to a Value Reader with the same "Extractor" configuration.

This lets Design page users use two conversion commands:

  • Convert To Data Type - Converts a Value Reader to a Data Type. The Value Reader's "Extractor" configuration becomes the Data Type's "Local Extractor" configuration.
  • Convert To Value Reader - Converts a Data Type to a Value Reader. The Data Type's "Local Extractor" configuration becomes the Value Reader's "Extractor" configuration.
    • This command will only function if only the Data Type's "Local Extractor" is configured and it has not child extractors. The command will not show up in the context menu if these parameters are unsatisfied.

You can access these command by right-clicking the Value Reader or Data Type in the Node Tree.