Value Reader - 2021
The Value Reader is a data extraction tool in Grooper. This object allows users to return values from a document in a variety of ways, including regular expression pattern matching, optical mark recognition, and barcode detection.
Value Readers are Grooper's "one stop shop" for data extraction. They return a single or a list of numerical or lexical results from a page or document folder's text data (obtained via OCR or native text extraction from the Recognize activity).
About
The Value Reader is a new extraction object introduced in Grooper 2021. It is designed to expand on the extractor functionality of Grooper's regular expression pattern matching capabilities to include newer extraction capabilities, such as extracting values next to OMR (optical mark recognition) checkboxes and barcode values. In previous versions, this functionality was split across multiple objects (or properties of multiple objects). The Value Reader extractor combines these disparate functionalities into a single extractor object with increased functionality. This object forms the foundation for extracting information from a document, using a variety of different methods.
- Do you need to extract a date? A Value Reader can do that!
- Do you need to extract anything matching a list of values? A Value Reader can do that!
- Do you need to extract English language unigrams (or bigrams etc)? A Value Reader can do that!
- Do you need to extract a value from a barcode? A Value Reader can do that!
- Do you need to extract the label next to a checked checkbox? A Value Reader can do that!
Do you need to find any value at all? You're going to use some kind of configuration of the Value Reader to do it.
Value Readers locate results using a variety of Extractor Types. The very first thing you will do when creating a Value Reader is decide which Extractor Type suits your extraction needs.
|
User Interface
Extractor Types
The Extractor Type options fall into one of five categories.
Category | Extractor Types | Comments |
Text Parsing Extractors |
|
These Extractor Types primarily rely on regular expression, lists of values (such as a Lexicon of field labels) or other forms of text parsing to return values.
|
OMR Extractors |
|
These Extractor Types allow you to return values using optical mark recognition. These are useful for extracting values on documents that use checkboxes to detail information. |
Barcode Extractors |
|
These Extractor Types allow you to return a value stored in a barcode. |
Zonal Extractors |
|
These Extractor Types are used to draw a logical rectangle somewhere on a document and return the text falling inside. These are useful for extracting values on highly structured documents where field values are consistently located on the same position on the page for every document. |
The Reference extractor |
|
The Reference option allows you to return the results of another extractor, whether that is a Value Reader, a Data Type, or a Field Class. |
Text Parsing Extractors
Pattern Match
The Pattern Match extractor relies on regular expression (regex) pattern matching to return values. This is truly the foundation for almost all data extraction in Grooper. A regex pattern entered in the Value Pattern will run against the selected document, page, or data instance's text data. Matching results will be returned as this extractor's values.
You can also enter Prefix and Suffix Patterns to only return data if the text matched by the Value Pattern also matches a regex pattern before or after. These are useful for anchoring the value you want to return next to some other piece of text. For example, a Prefix Pattern of \n
could be used to only return results at the start of a new line because the \n
character precedes every new line in the text data. Furthermore, only the data matched by the Value Pattern is returned.
The Output Format allows you to alter the output result for data cleansing or other purposes.
The "Properties" tab allows you to further configure the regex extraction. Here, you can enable Tab Marking, Fuzzy RegEx mode, filter results based on page location, determine case sensitivity, use a Lexicon to perform lookup operations, and more.
In this example, a Value Reader is configured to return currency values, using the Pattern Match Extractor Type.
|
FYI | Prior to Grooper version 2021, the Pattern Match extractor's functionality was delivered in one of two ways:
Each of these methods used a "Pattern Editor" UI screen to configure a regular expression. In version 2021, the Data Format object and the Internal and Text Pattern extractor types are gone. The Pattern Match extractor replaces their functionality to return results matching a regular expression pattern. |
List Match
The List Match extractor returns values matching one or more items in a defined list. This could be used to match a list of field labels on a form, a list of company names, a list of document titles, or any other list of words or phrases. You can even enable the use of regular expression syntax to match a list of regex patterns.
Just like with Pattern Match, you can enter Prefix and Suffix Patterns to only return an item in the list if a regex pattern also matches before or after. These are useful for anchoring the value you want to return next to some other piece of text. For example, a Prefix Pattern of \n
could be used to only return results at the start of a new line because the \n
character precedes every new line in the text data. Furthermore, only the list item is returned, not the text matched by the Prefix and Suffix Patterns.
The Output Format allows you to alter the output result for data cleansing or other purposes.
The "Properties" tab allows you to further configure the list matching. Here, you can enable Tab Marking, Fuzzy RegEx mode, filter results based on page location, determine case sensitivity, reference a Lexicon for the list, and more.
In this example, a Value Reader is configured to return the field labels in the top portion of this Closing Disclosure form, using the List Match Extractor Type
|
|
Commonly, a Lexicon is used as the list for List Match. This allows you to point to a Lexicon object rather than manually entering in the list using the Local Entries property. This is excellent when matching large lists of items or when a single list is used by multiple extractors.
|
|
We could create a Value Reader configured to return any of these items just by pointing to the Lexicon.
|
FYI | Prior to Grooper version 2021, the List Match extractor's functionality was accomplished using the FuzzyList Mode option when configuring a regular expression. As with the Pattern Match extractor, this was delivered in one of two ways:
Each of these methods used a "Pattern Editor" UI screen to configure a regular expression. FuzzyList was an option for the regex Mode in the "Properties" tab. In version 2021, the Data Format object and the Internal and Text Pattern extractor types are gone. The List Match extractor replaces their functionality to return results matching a local list or reference lexicon of values. |
Label Match
The Label Match extractor is extremely similar to the List Match extractor in that it matches one or more items in a defined list. However, it is designed specifically to work with the Labeling Behavior functionality (also referred to as "Label Sets"). It will use the fuzzy extraction and vertical and constrained wrapping settings defined on the Content Model if a Labeling Behavior is enabled. This way, you can have a single, unified set of fuzzy match settings for multiple extractors. Rather than configuring these settings, including the confidence score threshold and fuzzy weighting, for multiple extractors, you can configure them just once when enabling the Labeling Behavior and all Label Match extractors will use them.
- For more information on fuzzy extraction, visit the Fuzzy RegEx article.
For the Label Match extractor to return a result, two conditions must be met.
- The document folder must be classified.
- In other words, it must have a Document Type assigned to it.
- That Document Type must have a Labeling Behavior enabled.
- Either on the Document Type or, more typically, its parent Content Model.
|
In this example, a Value Reader is configured to return a small list of field labels on an invoice, using the Label Match Extractor Type
|
Word Match
The Word Match extractor is designed for n-gram extraction. An n-gram is "a contiguous sequence of n items from a given sample of text or speech." [1] Typically in Grooper, this refers to extracting words or phrases from a lexicon of terms. Often, this is for the purposes of feature collection for Lexical Classification. The Word Match extractor can capture 1-grams (single words) up to 5-grams (five word phrases). Lexicons are commonly used to dictate a dictionary of allowable returned words. This could be general Lexicon of common English words or a custom Lexicon, such as one with industry specific terms.
FYI |
An n-gram is often referred to by a different name depending its n size.
As an additional FYI, four-grams are not called "tetragrams" because the term already has usage as a single word consisting of four letters or characters. "Quadrigram" is occasionally used, but four-gram is the more common terminology. Five-grams are not called "pentagrams", because that already has common usage for a geometric figure. |
Just like with Pattern Match, you can enter Prefix and Suffix Patterns to only return an n-gram if a regex pattern also matches before or after. These are useful for anchoring the n-gram you want to return next to some other piece of text. For example, a Prefix Pattern of \n
could be used to only return n-grams at the start of a new line because the \n
character precedes every new line in the text data. Furthermore, only the n-gram is returned, not the text matched by the Prefix and Suffix Patterns.
The Join Pattern property is unique to the Word Match extractor. This determines how terms of bigrams, trigrams, four-grams, and five-grams can be joined. Most often, terms (or grams) are simply joined by a single space, as in the bigram "first second". If you leave this property blank, Grooper will assume n-grams are always separated by a single space. However, you may want to include n-grams that are separated by other characters. For example hyphenated words, as in "first-second". The Join Pattern allows you to enter a regular expression for the allowable characters between two grams. For example, a Join Pattern of [ -]
would allow for a single space or hyphen to be between each term, matching "first second" as well as "first-second".
The Output Format allows you to alter the output result for data cleansing or other purposes.
The "Properties" tab allows you to further configure the n-gram matching. Most importantly, the n-gram size is set here as well as any Lexicon used to lookup against the returned values. You can also enable Tab Marking, Fuzzy RegEx mode, filter results based on page location, determine case sensitivity, and more.
In this example, a Value Reader is configured to return bigram field labels, using the Word Match Extractor Type.
|
|
In this case, we also used the "Properties" tab to set the n-gram size to collect bigrams, and only return grams in a English language dictionary.
|
FYI | Prior to Grooper version 2021, n-gram extraction configuration was lumped into other regular expression pattern configurations. As with the Pattern Match extractor, this was delivered in one of two ways:
Each of these methods used a "Pattern Editor" UI screen to configure a regular expression. The n-gram size and referenced term lexicons were set in the "Properties" tab. In version 2021, the Data Format object and the Internal and Text Pattern extractor types are gone. The Word Match extractor replaces their functionality to return n-grams in an effort to simplify n-gram extraction setup and distinguish it from general regex pattern matching. |
Labeled Value
As the name implies, Labeled Value extractor is designed to return labeled values. A common feature of structured forms is to divide information across a series of fields. But it's not as if you just have a bunch of data randomly strewn throughout the document. Typically, the field's value will be identified by some kind of label. These labels provide the critical context to what the data refers to.
Labeled Value relies on the spatial relationship between the label and the value. Most often labels and their corresponding values are aligned in one of two ways.
1. The value will be to the right of the label. |
|
2. The value will be below the label. |
Labeled Value uses two extractors itself, one to find the label and another for the value. If the two extractors results are aligned horizontally or vertically within a certain amount of space (according to how the Labeled Value extractor is configured), the value's result is returned.
In this example, a Value Reader is configured to return the "Cash to Close" amount as described on the first page of a Closing Disclosure form, using the Labeled Value Extractor Type.
|
⚠ | The Labeled Value extractor has some increased functionality when used in combination with a Labeling Behavior. For more information on the Labeling Behavior and how Labeled Value benefits from it, visit the Labeling Behavior article. |
FYI | The Labeled Value extractor is extremely similar to the Key-Value Pair Collation Provider for Data Type extractors. Key-Value Pair collated Data Types also use two extractors, one to locate the "key" (the field's label) and one to locate the "value" (the field's value). If those two extractor results are aligned horizontally or vertically (according to the settings configured on the parent Data Type), the value is returned.
Labeled Value can be considered an alternate method to Key-Value Pair extraction. In many ways, it provides a simpler setup to produce the same result (plus added functionality when combined with Label Sets). But it is not a replacement. Key-Value Pair is still a Collation option for Data Type extractors. |
Field Match - About
The Field Match extractor allows you to match a value stored in a previously extracted Data Field. Instead of matching a regex pattern and returning a value, this extractor will match the Data Field's text and return a value. Or, you might think of it as the "pattern" is the Data Field's value. You can even utilize capabilities of a Pattern Match extractor. For example, you can use Prefix and Suffix Patterns to anchor the Data Field's value to a specific text based location, just like you can anchor a regex pattern with Prefix and Suffix Patterns with a Pattern Match extractor. You can also parse the Data Field's result with a Parse Pattern.
This could be used for data validation purposes. We've been using Closing Disclosure forms for most of these examples. Let's say the original imported file we imported for processing was named after the loan's borrower. So if John Doe was applying for the loan, the Closing Disclosure would be named "John Doe.pdf". You might want to know if the borrower's name according to the filename actually lines up with the borrower listed on the document. You could do just that with a Field Match extractor.
In this case, we could create a Data Field using a Default Value expression to return the name in the file name.
|
|||
Using a second Data Field, we're going to use a Field Match extractor to make sure the information returned by the "Borrower From File Name".
|
|||
|
|||
|
Field Match - Configuration
But how do you build a Field Match extractor? In this example, a Value Reader is configured to return borrower's name on a Closing Disclosure form if it matches a Data Field returning the borrower's name from the native file's name, using the Field Match Extractor Type. This is the Value Reader using the Field Match extractor described in the About section above.
|
|||
However, let's look at a couple other things you may want to consider when using the Field Match extractor.
|
OMR Extractors
OMR stands for "optical mark recognition". Many structured forms utilize checkboxes in order to detail information. Filling out a form is much quicker if you can just check a box to indicate a choice from a list of options rather than printing a response. It also makes it easy on you if you're presented the list of possible options, even if that is as simple as checking a box next to "Yes" or "No".
However, checkboxes don't translate to a text character when recognizing a document's text data through OCR or native next extraction. In order for Grooper to understand if a checkbox is checked, it must digitally recognize the box and its "check state", checked or unchecked. OMR is this digital process of recognizing checkboxes and their check states.
In previous versions of Grooper (pre 2021), checkboxes and their check states were only determined from layout data detected by a Box Detection or Box Removal IP Command and saved to the Batch Page (or in some cases Batch Folder) of a document. Layout data collection is still important for the three OMR extractors.
Improvements have been made to the Labeled OMR extractor allowing it to return labels next to checkboxes without first obtaining the document's layout data. However, it will still use detected checkbox information in the layout data, if present. In most cases, this results in Labeled OMR being the simplest and most effective OMR extractor. Furthermore, it is the only OMR extractor that is capable of extracting radio buttons and their "press state" as checkboxes (Radio buttons are not detectable by Box Detection or Box Removal as they are not boxes).
Ordered OMR and Zonal OMR exist as options where Labeled OMR fails to produce the desired result. However, they typically require more configuration and checkbox information from the layout data is required in order to return a result.
Obtaining Layout Data
Layout data is visual information on a document obtained during an image processing operation. This includes checkboxes and their check states, line locations, and barcodes. Image processing in Grooper is primarily performed to clean up a document in order to obtain better OCR results from printed text, but it's also used to obtain this layout data. This is controlled by creating an IP Profile, which is a step by step list of IP Commands, each one performing a different image processing operation. This includes layout data collecting IP Commands, such as Box Detection and Box Removal.
The IP Profile can be executed permanently, affecting the archival export of the document, or temporarily, reverting back to the original image after OCR is performed.
- For permanent image processing, the IP Profile is executed by the Image Processing activity.
- For temporary image processing, the IP Profile is executed by the Recognize activity.
- The Recognize activity obtains a document's text data via OCR for scanned or image-based documents or native text extraction for digital documents with native machine readable text already present.
- For scanned or image-based documents, the IP Profile is referenced in the OCR Profile used.
- For digital documents, you aren't performing OCR. Machine readable text is already present as part of the document's content and Recognize extracts that native digital text. However, the Alternate IP property can be used to reference an IP Profile containing layout data collecting IP Commands.
- The Recognize activity obtains a document's text data via OCR for scanned or image-based documents or native text extraction for digital documents with native machine readable text already present.
In either case, if the IP Profile contains layout data collecting IP Commands, layout data will be stored on the processed Batch Page object's "Grooper.Layout.json" file. For example, Box Detection and Box Removal will store checkbox locations and their check states (either checked or unchecked). The OMR extractors will then use this layout data to return labels next to checked boxes.
|
FYI: Layout Data Verification
Most often, a Box Removal command is executed by a temporary IP Profile during the Recognize activity. However whether executed during Image Processing or Recognize, either way will save the checkbox information to a Batch Page object's "Grooper.Layout.json" file. You can verify this with the "Files" tab of the "Advanced" tab when selecting a Batch Page or Batch Folder processed with a Box Detection command.
|
Labeled OMR
The Labeled OMR extractor is designed to be the easiest OMR extractor to set up. In most cases, it's as simple as define the list of labels next to the checkboxes, determine if multiple boxes can be checked, just one, or if a checked box evaluates to a binary true/false value, and the label (or labels) next to the checked boxes are returned.
The Labeled OMR has just two properties necessary for its configuration: Label Extractor and Mode
The Label Extractor serves to return any labels next to a checkbox. In many cases, this is a simple as using a List Match extractor, entering a list of the text labels on the document.
The Mode property corresponds with how the checkboxes behave on the document. Can just one checkbox out of many be checked? Can multiple? Is there just one checkbox where it being checked means one thing and unchecked another? This can be one of three options: CheckOne, CheckMulti or Boolean
- CheckOne will target multiple checkboxes but presumes only one may be checked. This is for when documents present a list of options, only one of which may be chosen.
- CheckMulti will target multiple checkboxes and presumes any number of them may be checked. This is for when documents present a list of options, but any one of them can be chosen.
- All labels are returned as a single concatenated result. This results may be separated by a Separator String. For example a
,
could be used to create a comma separated list of checked values.
- All labels are returned as a single concatenated result. This results may be separated by a Separator String. For example a
- Boolean targets only a single checkbox, presuming the checkbox represents a Boolean "true or false" answer. This is for when documents present a single checkbox where checking the box indicates one thing and leaving it unchecked means another.
- By default the result will evaluate to either "True" or "False" but this can be altered using the Value If Checked and Value If Unchecked properties.
In this example, a Value Reader is configured to return the type of loan applied for on a "Closing Disclosure" form, using the Labeled OMR Extractor Type.
|
FYI | In previous versions of Grooper, you were required to configure which direction the checkbox was from the label (North, South, East or West).
As of version 2021, this is no longer required. Labeled OMR will now look for the closest box within an inch of the label in any direction. |
On Layout Data and Labeled OMR
Labeled OMR is unique among the OMR extractors in that Box Detection layout data is not necessary in order for it to function. It will use that data if present, but will also attempt to determine if a checkbox is next to a label if it is not.
This is exceptionally useful for non-standard checkboxes difficult or impossible for Box Detection to detect. For example, radio buttons. Radio buttons are just circles. If the button is checked, it has a dot inside it, but otherwise just a circle. Box Detection only detects boxes, squares and rectangles. There's no way a radio button's location and check state will be stored in a documents layout data file.
Not to fret! New improvements to the Labeled OMR extractor in version 2021 allow it to detect radio buttons and other non-standard checkboxes.
Here, the Labeled OMR extractor is unchanged from the example described above.
|
Ordered OMR
The Ordered OMR extractor is a little more complicated to set up than Labeled OMR but can be used in cases where Labeled OMR is not producing the desired results. Furthermore, labels are not even required to be present at all (but can be optionally helpful as "anchors", positioning where the checkboxes are on the page). This extractor can prove very useful when you have structured forms with OMR checkboxes whose labels are not easily matched due to poor OCR.
However, checkbox data must be present in the document's layout data. Ordered OMR will not function without this data present before executing. Furthermore, Ordered OMR assumes the checkboxes will be ordered one after the other either vertically or horizontally along a single line. For documents with sections of checkboxes broken up into multiple columns (for vertically ordered checkboxes) or multiple lines (for horizontally ordered checkboxes), multiple Ordered OMR extractors may be necessary (one for each column or line of checkboxes).
For Ordered OMR you will indicate where boxes are on a document by drawing a rectangular zone around the checkboxes. All checkboxes must fall within the drawn zone (This distinguishes Ordered OMR from Zonal OMR. For Zonal OMR, a single zone is drawn for each individual checkbox). This zone is configured using the Location property. This can be one of four options:
- Fixed Region - This option is the simplest to set up. As the name implies, the rectangular zone will be fixed on the page. It will stay in the same coordinates for every document. All you need to do is draw the box around the checkboxes.
- Relative Region - Instead of setting the zone in a fixed location for every document, the Relative Region mode will anchor the zone to a text label on the document. The zone's position will change relative to the label's position on the document, but will still have the same drawn dimensions.
- This option is useful to overcome issues arising during scanning printed documents. Slight variations can occur as to where a value is when printing or scanning a document, even for very structured documents. This can cause problems when drawing a single fixed region for the zone. However, if you can anchor the zone off an extractable text value, the zone's position will shift according to that anchor's position.
- Text Region - The Text Region option creates a rectangular zone using the logical boundaries of an extraction result. This can create the zone within the boundaries of the extractor's result.
- This can also be configured to provide results in a similar way the Relative Region option does, using text anchors located by an extractor to position the extraction zone's location. This means both methods can be used to position the zone relative to a point from document to document. The main difference is in how the zone is drawn.
- Shape Region - The Shape Region option is extremely similar to the Text Region option. However, instead of using text to anchor the extraction zone, it uses a shape detected from a Shape Detection or Shape Removal IP Command.
- This is the least common method used.
Ordered OMR must also have its Mode property configured. This property behaves the same as Labeled OMR. It determines how many checkboxes should be checked for the checkboxes falling within the rectangular zone. This can be one of three options:
- CheckOne will target multiple checkboxes but presumes only one may be checked. This is for when documents present a list of options, only one of which may be chosen.
- CheckMulti will target multiple checkboxes and presumes any number of them may be checked. This is for when documents present a list of options, but any one of them can be chosen.
- All labels are returned as a single comma-separated result.
- Boolean targets only a single checkbox, presuming the checkbox represents a Boolean "true or false" answer. This is for when documents present a single checkbox where checking the box indicates one thing and leaving it unchecked means another.
Ordered OMR is different from Labeled OMR in that the output values are not located by a Label Extractor. Instead, you will use the Output Values property to list each checkbox's corresponding value. Ordered OMR presumes the checkboxes will be stacked on top of each other or next to each other. They will either be stacked on top of each other in a single column, or they will be ordered next to each other across a single horizontal line. Using the Output Values property, you will enter a comma-separated list of the checkboxes' labels from top to bottom or left to right.
- When selecting the Boolean Mode, only two values may be entered.
The Flow Direction property determines how they are ordered, either Vertical or Horizontal. Vertical is appropriate for boxes stacked on top of each other. Horizontal is appropriate for boxes next to each other along a horizontal line.
In this example, a Value Reader is configured to return the "This estimate includes" options for a "Closing Disclosure" form, using the Ordered OMR Extractor Type.
|
Zonal OMR
For Zonal OMR a rectangular zone must be drawn around the location of each individual checkbox (rather than a single zone for all checkboxes as is the case for the Ordered OMR extractor). This typically makes Zonal OMR the most time consuming of the OMR extractors as far as set up goes, but may be necessary to target checkboxes on forms whose order is not targetable by Ordered OMR (for example, checkboxes in non-standard orientations next to their labels) and/or whose labels cannot be extracted by Labeled OMR (for example, due to poor OCR).
Furthermore, (just like the Ordered OMR extractor) checkbox data must be present in the document's layout data. Zonal OMR will not function without this data present before executing.
Just like Labeled OMR and Ordered OMR, Zonal OMR must also have its Mode property configured. It determines how many checkboxes should be checked for the checkboxes falling within the rectangular zones. This can be one of three options:
- CheckOne will target multiple checkboxes but presumes only one may be checked. This is for when documents present a list of options, only one of which may be chosen.
- CheckMulti will target multiple checkboxes and presumes any number of them may be checked. This is for when documents present a list of options, but any one of them can be chosen.
- All labels are returned as a single comma-separated result.
- Boolean targets only a single checkbox, presuming the checkbox represents a Boolean "true or false" answer. This is for when documents present a single checkbox where checking the box indicates one thing and leaving it unchecked means another.
- By default the result will evaluate to either "True" or "False" but this can be altered using the Value If Checked and Value If Unchecked properties.
The Anchor property allows all drawn zones for each checkbox to be anchored to an extractible text result. This is similar to how the Relative Region Location option of Ordered OMR can anchor a zone to a relative location on the page rather than a fixed position that remains the same for each and every document.
- This option is useful to overcome issues arising during scanning printed documents. Slight variations can occur as to where a value is when printing or scanning a document, even for very structured documents. This can cause problems when drawing a single fixed region for the zone. However, if you can anchor the zone off an extractable text value, the zone's position will shift according to that anchor's position.
The rectangular zones are drawn using the OMR Boxes property. This will bring up a collection editor to draw registration zones for each checkbox. Here, you will also enter the output result for each OMR zone. This editor also allows you to anchor zones to a text anchor for each individual zone, meaning you can anchor a single zone here (as well as anchoring the collection of OMR zones using the Anchor property described above).
In this example, a Value Reader is configured to return the type of loan applied for on a "Closing Disclosure" form, using the Zonal OMR Extractor Type.
|
|
In this case, we have four checkboxes. So, we have added four items to our collection list here, one for each type of application. |
|
Remember how we ignored the "Other" application type for our Labeled OMR example? The text entered for the "Other" application type could be anything, which makes it more difficult to extract a label. The text could be anything under the sun.
Note: This doesn't mean you couldn't use Labeled OMR and still successfully extract the text label here (even though it could be anything!). It would just require a more complex extractor than we used for our earlier example. |
Barcode Extractors
Both the Find Barcode and Read Barcode extractors will return a barcode's encoded value as its result. Barcodes encode information according to the "symbology" used. Grooper has the ability to detect and read 29 different symbologies, including Code128, Code39, US postal barcodes, QR codes, and UPC barcodes.
For both the Find Barcode and Read Barcode extractors, you must specify which barcode symbology you're looking for. The difference between the two Extractor Types is when Grooper's barcode detection runs before the value is extracted.
- Find Barcode - The barcode's value must be obtained by a Barcode Detection or Barcode Removal IP Command in an IP Profile before the extractor executes.
- When the Read Barcode extractor executes on a document folder, it looks for the barcode value in the Batch Folder's layout data file or it's child Batch Pages' layout data files. If present, the barcode's value is returned.
- Read Barcode - This extractor executes barcode detection every time the extractor executes. This means a layout data file is not necessary for the extractor to return a barcode value.
- No layout data required. No previous Barcode Detection or Barcode Removal IP Command is necessary.
- The extractor is configured to locate a barcode in nearly the same way you configure the Barcode Detection IP Command to detect the barcode.
Your decision as to which one you choose to use will largely be based on when you want to expend the processing time required to detect the barcode. Read Barcode performs barcode detection when the extractor executes. For Find Barcode, the extractor presumes the barcode was already detected and present in the document folder's layout data. Layout data is stored in a file named "Grooper.Layout.json" when an IP Command such as Barcode Detection detects layout data on a Batch Page (or in certain cases a Batch Folder).
Find Barcode simply uses the "Grooper.Layout.json" file to return the barcode value. Barcode detection takes time. The processing time to read a document's "Grooper.Layout.json" file might take 3 milliseconds. The processing time to detect a barcode on a document takes significantly more processing power. Let's say 300 milliseconds. Find Barcode just finds the barcode value in the "Grooper.Layout.json" file, taking 3 ms to return a value. On the other hand, Read Barcode must first read the barcode before returning the value, taking 300 ms to return a value. Therefore, Find Barcode has a faster runtime execution than Read Barcode.
However, this presumes the barcodes value is already present in the "Grooper.Layout.json" file. So, you're still going to have to find the barcode at some point in the Batch Process. You just have to ask yourself if you want to spend that processing time during extraction or ahead of time in some kind of image processing operation (either the Image Processing activity for permanent image processing or the Recognize activity for temporary image processing).
Read Barcode
In this example, a Value Reader is configured to return the date encoded in a barcode, using the Read Barcode Extractor Type.
|
Configuration Prereqs - Obtain Layout Data
In the case of Find Barcode, the barcode must previously be detected by a Barcode Detection or Barcode Removal IP Command with its value stored in the document's layout data.
From this point, barcode detection is configured exactly the same way as seen in the Read Barcode example.
|
Before executing the Find Barcode extractor, the IP Profile with the Barcode Removal or Barcode Detection IP Command must be executed to save the barcode value in the document's layout data.
An IP Profile can be executed in one of two ways:
- During the Image Processing activity for permanent image processing.
- During the Recognize activity for temporary image processing.
- For OCR of image based documents, the 'IP Profile is assigned to the OCR Profile assigned to the Recognize activity.
- For native text extraction of digital document, an IP Profile can be assigned to the Alternate IP property of the Recognize activity.
FYI: Layout Data Verification
Most often, a Barcode Removal command is executed by a temporary IP Profile during the Recognize activity. However whether executed during Image Processing or Recognize, either way will save the barcode value to a Batch Page object's "Grooper.Layout.json" file. You can verify this with the "Files" tab of the "Advanced" tab when selecting a Batch Page or Batch Folder processed with a Barcode Detection command.
|
Find Barcode
In this example, a Value Reader is configured to return the date encoded in a barcode, using the Find Barcode Extractor Type. This document's layout data was previously obtained with the IP Profile described above during the Recognize activity.
|
Zonal Extractors
Read Zone
The Read Zone extractor allows you to extract text data in a rectangular region (called a "extraction zone" or just "zone") on a document. This can be a fixed zone, extracting text from the same location on a document, or a zone relative to an extracted text anchor or shape location on the document.
Read Zone is useful for extracting data from highly structured documents. If a document's structure is fixed, it's going to have the same fields in the same physical location from one document to the next. The Closing Disclosure forms we've been looking at in this article are themselves fairly fixed. For example, the "Loan Amount" listed on the first page is more or less in the same spot for every single Closing Disclosure. The dollar amount itself may change, but there's only so much room that amount can take up on the document.
If you can draw a rectangle around the value you want to extract, and the value falls within the boundaries of that rectangle for every single document, extraction may be as simple as just extracting the text in the rectangle's location. This is referred to as "zonal extraction". You draw a zone where the value exists on the page and return the text data falling in the zone.
Read Zone has a few different options for where the box is placed using the Location property. This can be one of four options:
- Fixed Region - This option is the simplest to set up. As the name implies, the extraction zone will be fixed on the page. It will stay in the same coordinates for every document. All you need to do is draw the box where you want to extract data.
- Relative Region - Instead of setting the extraction zone in a fixed location for every document, the Relative Region mode will anchor the zone to a text label on the document. The extraction zone's position will change relative to the label's position on the document, but will still have the same drawn dimensions.
- This option is useful to overcome issues arising during scanning printed documents. Slight variations can occur as to where a value is when printing or scanning a document, even for very structured documents. This can cause problems when drawing a single fixed region for the extraction zone. However, if you can anchor the zone off an extractable text value, the zone's position will shift according to that anchor's position.
- Text Region - The Text Region option creates an extraction zone using the logical boundaries of an extraction result. This can return all the text falling within the boundaries of the rectangle around the extractor's result.
- This can also be configured to provide results in a similar way the Relative Region option does, using text anchors located by an extractor to position the extraction zone's location. This means both methods can be used to position the zone relative to a point from document to document. The main difference is in how the zone is drawn.
- Shape Region - The Shape Region option is extremely similar to the Text Region option. However, instead of using text to anchor the extraction zone, it uses a shape detected from a Shape Detection or Shape Removal IP Command.
- This is the least common method used.
The Read Zone extractor can optionally re-process the text data with an OCR Profile. This can be used to perform custom OCR on the extracted text.
The text in the zone can also be itself extracted by a Value Extractor. This allows you to break up the document into a smaller portion and run an extractor on just the zone instead of the full document. Essentially, you use the Read Zone extractor to create a smaller data instance (from the larger document data instance) and use its Value Extractor property to return data from the smaller data instance.
In this example, a Value Reader is configured to return the "Loan Amount" value as described on the first page of a Closing Disclosure form, using the Read Zone Extractor Type.
|
Highlight Zone
The Highlight Zone extractor is unique in that it doesn't actually extract anything at all!
So, why use it? Highlight Zone can be useful for quickly calling attention to Data Fields requiring manual validation during a Data Review activity. For example, handwritten fields on a form are unlikely to be recognized by OCR. OCR is designed to read machine printed characters. While there are some recent advancements in handwriting detection (Microsoft Azure's OCR service is particularly adept at recognizing handwritten characters), OCR engines tend to fail at recognize handwriting. In the case of fields written in by hand, you will likely need a human being to enter that information during Data Review.
However, if those fields are in the same spots on the document, you can use Highlight Zone to draw the data entry clerk's attention to its location on the page, saving them time and you money.
The highlighted zones are drawn using the exact same Location methods and property configurations as Read Zone.
In this example, a Value Reader is configured to highlight a signature field, using the Highlight Zone Extractor Type.
|
Detect Signature
While we're on the topic of signatures, there is a type of zonal extractor specifically designed to detect if a signature is present or not. This is the Detect Signature Extractor Type option. It's very similar to the Read Zone extractor in that you use one of the four Location options (Fixed Region, Relative Region, Shape Region or Text Region) to draw an extraction zone on a geographic region of the page.
However, rather than returning the OCR or native text data within the zone, an OMR-style extraction is performed. Think about a signature line. If you drew a box around where you expect someone to sign, nothing would be in the box if it was not signed. But regardless of the signature, some of the box would be filled in if it were.
The same basic concept applies for the Detect Signature extractor. Detect Signature determines this by a simple pixel count of the percentage of black pixels in the zone. Essentially, the extractor counts the number of black pixels in the extraction zone. If the number of black pixels falls above a certain percentage threshold, the extractor returns a value of "Signed" and if below it returns a value of "Not Signed".
In this example, a Value Reader is configured to return whether or not the "Applicant Signature" is present on the Closing Disclosure form, using the Detect Signature Extractor Type.
|
FYI |
Keep in mind the Detect Signature extractor will examine the pre-processed image (not the image seen in the Document Viewer) if an IP Profile has been referenced using the IP Profile property. Furthermore, this type of operation requires a black and white image to work. Grooper knows a pixel is "filled" because it is black and not white. If you do not use an IP Profile to pre-process the document's image and the document is color or grayscale, Grooper will pre-process the image on its own. If you want control over how Grooper turns the image black and white, this is another reason you may want to us an IP Profile to customize how this is done, using a Threshold or Binarize command. |
The Reference Extractor
The Reference extractor is just an extractor that's returning the results of another extractor object in the node tree. You can use the Reference Extractor Type to reference any of the three extractor objects: Data Types, Value Readers, and Field Classes. This can be useful to keep you from duplicating your efforts over and over again. For example, tf you have a variety of different extractors needing to return a currency value, don't create a new currency extractor every time you need to return a currency value. Just create a single currency value and use the Reference 'Extractor Type to use it and re-use it over and over.
In this example, a Value Reader is configured to return the results of the currency extractor we created for the Pattern Match example (named "Pattern Match - Currency), using the Reference Extractor Type.
|
Value Readers vs Data Types
Before version 2021, the Data Type extractor object was considered the bread and butter of data extraction. For many Grooper users a "data extractor" and a Data Type are synonymous. In some ways, this may remain to be the case. However, the introduction of the Value Reader object was (at least in part) designed to conceptually distinguish a Data Type from a "general purpose extractor". Instead, we should emphasize the Data Type's primary function in Grooper: Data Collation.
- A Value Reader is a Grooper object designed for data extraction, returning the initial data set from the document.
- A Data Type is Grooper object designed for data collation, processing and returning the final data set from the document.
- Even choosing not to collate results (using the Individual Collation Provider) is still a method of collating data. It's just the simplest way of collating data.
Because data collation is so important for a variety of extraction techniques it's almost natural to equate collated data with extracted data. But, there's really two parts of what's going on. First, values are extracted from the document's text data then they are collated and finally returned by the Data Type.
While both Value Readers and Data Types are considered "extractors", they really have two different jobs as far as Grooper is concerned. One way to think of this is a Value Reader is a "data finder" while a Data Type is a "data manipulator". It is a Value Reader's job to locate and return data from a document. It's a Data Type's job to take that data and organize it, manipulate it or impose constraints on what counts as valid data.
For example, a Data Type can be configured to only return values that are stacked on top of one another in a vertical array (using the Array Collation Provider). Any values not aligned with each other vertically are tossed out of the final result. The results are collated into a vertical array. Before the results can be collated, they have to be found. Only then can it be determined if and how they are organized and manipulated. That's the true job of the Value Reader, finding and returning the initial data set. The Data Type then collates and returns the final data set.
To drive this point home, take a look at the Extractor property of the Data Type.
|
Version Differences
2021: Introducing the Value Reader Object
The Value Reader is a new object added in version 2021. It combines the functionality of other objects or properties of objects from previous versions by either supplementing them or replacing them entirely . This includes:
- The Data Format
- The Data Format object is gone in 2021. Its functionality is replaced mostly through the Pattern Match Extractor Type. Although, the List Match and Word Match Extractor Types comprise some of its functionality as well.
- The Internal and Text Pattern extractor options
- These extractor options were available to various extractor properties on various objects in Grooper, such as the Value Extractor property of a Data Type. This allowed you to use a "Pattern Editor" to configure regex pattern extraction local to the object (using the same configuration UI as the Data Format object). These options are gone in 2021. Their functionality is also replaced by the Pattern Match, List Match, and Word Match Extractor Types
- The Fuzzy List extraction mode
- This is also gone in 2021. Previously Fuzzy List was a regex Mode option for Data Formats and Internal and Text Pattern extractors. This functionality is replaced by the List Match Extractor Type.
- The Labeled OMR, Ordered OMR, Zonal OMR, Find Barcode, Read Barcode, Highlight Zone, and Read Zone extractor options
- Previously, these extractor options were only available to a handful of objects through their property configuration. For example, the Data Field's Value Extractor property. While still available to these properties, they are also available to the Value Reader as an Extractor Type option. This means these extraction methods are available anywhere in Grooper a Value Reader can be referenced, not just to a limited few objects.
Furthermore, three brand new extraction methods were created out of whole cloth in Grooper 2021 and are available to the Value Reader as Extractor Types: Labeled Value, Field Match and Read Substring