2023:Labeled OMR (Value Extractor): Difference between revisions
No edit summary |
No edit summary |
||
| Line 14: | Line 14: | ||
</blockquote> | </blockquote> | ||
{| | == About == | ||
{|cellpadding="10" cellspacing="5" | |||
|- | |- | ||
| | |style="font-size:14pt; color:#f89420; border: 2px solid #f89420; width:40px"|[[File:Asset 22@4x.png]] | ||
[[Labeled OMR - | |style="border: 2px solid #f89420"| | ||
You may download and import the file below into your own Grooper environment (version 2021). This contains a '''Batch''' with the example document(s) discussed in this tutorial and a '''Content Model''' configured according to its instructions. | |||
* [[Media:Wiki - Labeled OMR - v2021.zip]] | |||
* [[Media:Wiki - Labeled OMR (Label Sets) - v2021.zip]] | |||
** This contains a '''Batch''' and '''Content Model''' that use Label Sets. Import this if your license has Label Sets enabled and want to review '''''Labeled OMR's''''' Label Set aware functionality. | |||
|} | |} | ||
Documents use checkboxes to make our life easier. They are particularly prevalent on structured forms. It gives the person filling out the form the ability to just check a box next to a series of options rather than typing in the information. | Documents use checkboxes to make our life easier. They are particularly prevalent on structured forms. It gives the person filling out the form the ability to just check a box next to a series of options rather than typing in the information. | ||
| Line 354: | Line 356: | ||
# We've navigated to the "Labels" tab to collect labels for our '''Data Field'''. | # We've navigated to the "Labels" tab to collect labels for our '''Data Field'''. | ||
# The "13. Type of Cow" '''Data Field''' is as of yet unconfigured. | # The "13. Type of Cow" '''Data Field''' is as of yet unconfigured. | ||
Notice that we only have one area to add labels, but we have three options for the OMR. Instead of collecting a label, we're going to configure the '''Data Field'''. | |||
|valign=top| | |valign=top| | ||
[[File: | [[File:2023 Labeled OMR - 2023 02 How To 04 Using Label Sets 01.png]] | ||
|- | |- | ||
|valign=top| | |valign=top| | ||
| Line 370: | Line 368: | ||
# Select the '''Data Field'''. | # Select the '''Data Field'''. | ||
# Scroll to the bottom of the '''Data Field's''' property grid and expand the '''''List Values''''' property. | # Scroll to the bottom of the '''Data Field's''' property grid and expand the '''''List Values''''' property. | ||
# Select the '''''Local Entries''''' property and | # Select the '''''Local Entries''''' property and click the ellipsis button at the end. | ||
|valign=top| | |||
[[File:2023 Labeled OMR - 2023 02 How To 04 Using Label Sets 02.png]] | |||
|- | |||
|valign=top| | |||
# The difference between using Label Sets and using '''''List Values''''' alone is you can enter whatever you want for the ORM labels' values here. | # The difference between using Label Sets and using '''''List Values''''' alone is you can enter whatever you want for the ORM labels' values here. | ||
#* Imagine we have a requirement in whatever backend system stores these values. Each result can only be "DMST" "FARM" or "SHOW". Well, "DMST" doesn't even appear on the document. But, it doesn't matter. Grooper isn't going to use these values to match against the document. They're merely providing a list of values for which we will capture a label later. | #* Imagine we have a requirement in whatever backend system stores these values. Each result can only be "DMST" "FARM" or "SHOW". Well, "DMST" doesn't even appear on the document. But, it doesn't matter. Grooper isn't going to use these values to match against the document. They're merely providing a list of values for which we will capture a label later. | ||
# Click "OK" to save your changes. | |||
|valign=top| | |valign=top| | ||
[[File: | [[File:2023 Labeled OMR - 2023 02 How To 04 Using Label Sets 03.png]] | ||
|- | |- | ||
|valign=top| | |valign=top| | ||
| Line 381: | Line 383: | ||
Now, we will be able to collect labels for each item we've added to the list. | Now, we will be able to collect labels for each item we've added to the list. | ||
# Navigate back to the '''Content Model''' | # Navigate back to the '''Content Model'''. | ||
# Navigate to the "Labels" tab. | # Navigate to the "Labels" tab. | ||
# Select the document assigned the '''Document Type''' whose labels you want to collect. | # Select the document assigned the '''Document Type''' whose labels you want to collect. | ||
# Select the '''Data Field''' whose OMR values you've added to its '''''List Values''''' list. | # Select the '''Data Field''' whose OMR values you've added to its '''''List Values''''' list. We have now labels we can collect. You are able to capture labels for each value added to the list. | ||
# Capture the full label for the corresponding OMR value. | # Capture the full label for the corresponding OMR value. | ||
#* Grooper will use the the captured label to locate the checkbox and determine its state, but ''output'' the value in the '''''List Values''''' list. | #* Grooper will use the the captured label to locate the checkbox and determine its state, but ''output'' the value in the '''''List Values''''' list. | ||
|valign=top| | |valign=top| | ||
[[File: | [[File:2023 Labeled OMR - 2023 02 How To 04 Using Label Sets 04.png]] | ||
|- | |- | ||
|valign=top| | |valign=top| | ||
| Line 397: | Line 398: | ||
# Select the '''Data Field''' in the Node Tree. | # Select the '''Data Field''' in the Node Tree. | ||
# Change the '''''Value Extractor''''' property to ''Labeled OMR''. | # Change the '''''Value Extractor''''' property to ''Labeled OMR''. | ||
|valign=top| | |||
[[File:2023 Labeled OMR - 2023 02 How To 04 Using Label Sets 05.png]] | |||
|- | |||
|valign=top| | |||
# Navigate to the "Tester" tab. | |||
# Click on the play button to test the extraction. | |||
# The extractor will use the collected labels in the Label Set to match the OMR labels, find nearby checkboxes, and determine their check states. | # The extractor will use the collected labels in the Label Set to match the OMR labels, find nearby checkboxes, and determine their check states. | ||
# The output value will be the corresponding value we entered in the '''''Value List'''' for the label. | # The output value will be the corresponding value we entered in the '''''Value List''''' for the label. | ||
|valign=top| | |valign=top| | ||
[[File: | [[File:2023 Labeled OMR - 2023 02 How To 04 Using Label Sets 06.png]] | ||
|} | |} | ||
</tab> | </tab> | ||
Revision as of 10:34, 28 November 2023
|
WIP |
This article is a work-in-progress or created as a placeholder for testing purposes. This article is subject to change and/or expansion. It may be incomplete, inaccurate, or stop abruptly. This tag will be removed upon draft completion. |

Labeled OMR is an extractor used to output OMR checkbox labels. It determines whether labeled checkboxes are checked or not. If checked, it outputs the label(s) as the result.
About
|
You may download and import the file below into your own Grooper environment (version 2021). This contains a Batch with the example document(s) discussed in this tutorial and a Content Model configured according to its instructions.
|
Documents use checkboxes to make our life easier. They are particularly prevalent on structured forms. It gives the person filling out the form the ability to just check a box next to a series of options rather than typing in the information.
However, most of Grooper's extraction centers around regular expression, matching text patterns and returning the result. There isn't necessarily a character to match a checked checkbox. Regular expression isn't going to cut it to determine if a box is checked or not.
This is where OMR comes into play. OMR stands for "Optical Mark Recognition". OMR determines checkbox states. The basic idea behind it is very simple. First find a box. A box is just four lines connected to each other in a square-like fashion. If that box has a mark of some kind inside it, it is checked. If not, it's not. Checked (or marked) boxes, whether a checked "x" (☒), a checkmark (☑), or a check block (▣), while have more black pixels inside the box than an unchecked (or unmarked) one (☐). If the detected box has a high threshold of black pixels in it, it's checked (or marked). If not, it's unchecked (or unmarked).
A simple example would be a document asking a question and giving two boxes to check “Yes” or “No.” For example, see the portion of the document below asking if the applicant is a U.S. Citizen. “Yes” or “No” would be the labels. Either “Yes” or “No” would be the field's final result, depending on which box is checked. In this case, "Yes".
![]() |
In general, what you want to extract is the text of the checked label. The Labeled OMR extractor allows you to do just that.
|
First, you will set up an extractor to locate the text labels.
|
|
|
Then, Grooper's OMR detection will determine if there is a box next to the label, and whether or not that box is checked. |
|
|
Last, if the label is checked, the label is returned as the extractor's result. |
| FYI |
Labeled OMR has multiple extraction modes depending on how checkboxes behave on the document. There is also a Boolean mode to simply output "True" or "False" if a single checkbox is checked or not. We will discuss the different extraction modes further in the #How To section of this article. |
How To
Assign the Extractor
The Labeled OMR extractor can be utilized in two ways:
- As a Value Reader's extractor type.
- As an object's extractor property configuration. For example:
- As a Data Field's Value Extractor property's extractor configuration.
- As a Data Type's Local Extractor property's extractor configuration.
- As a Document Type's Positive Extractor property's extractor configuration.
- And more!
Value Reader
|
The Labeled OMR extractor is one of the extractor types available to the Value Reader extractor object.
|
|
|
Extractor Property
You may also configure a Labeled OMR extractor when configuring an extractor property. Many Grooper objects have some kind of extractor property in their property grids. Labeled OMR is one of the options that can be selected as the extractor type.
|
For example, Data Field objects have a Value Extractor property, which collects a result when the Data Model is extracted during the Extract activity.
|
|
|
Configure the Extractor Part 1: OMR Labels
The first part of the Labeled OMR extractor's configuration is label extraction. Labels can be collected in one of three ways:
- Using the Label Extractor property.
- Using the List Values settings of a Data Field.
- Collecting labels for the OMR labels when using Label Sets.
- When we get to this point, this article will presume you have some familiarity with Label Sets and the Labeling Behavior functionality. For more information on Label Sets please visit the Label Sets article.
|
|
At this point, the Labeled OMR extractor is totally unconfigured. Next, we will detail each of the three different ways to extract OMR labels. While the configuration is slightly different, the goal is the same: Locate text labels next to checkboxes. Each method has its own strengths and weaknesses, giving you flexibility in how you locate the OMR labels based on your documents' circumstances.
Using the Label Extractor
Moderate to high level of work up front. High flexibility in configuration options.
One way to locate OMR labels is by configuring the Labeled OMR extractor's Label Extractor property. In some ways, this is the most "effort intensive" of the three options. It will require you to configure an extractor to return each of the labels for the set of OMR checkboxes. This means a lot of manual configuration of property grids and/or external extractor objects, depending on the complexity of your documents.
However, it is also extremely reliable with a huge amount of flexibility. Since you configure an extractor to return the labels, you have all the extraction tools available to Grooper's suite of extraction types and extraction logic.
When other methods can't get the job done, configuring the Label Extractor property will be your go-to method to locate OMR labels.
|
|
|||
|
|
|||
|
|||
|
|
|||
|
|
Tips and Tricks: Translating Output
Often, it is the case the label on the document is not exactly what you want to collect for your data set. You may want to adjust the output value in one way or another. For example, you may want to collect the value "FARM" instead of the full text "FARM (Agriculture)".
This is easily done when using the List Match extractor for your Label Extractor.
|
|
|
|
|
|
|
|
Using a Data Field's List Values
A simple solution for the most simple cases.
This next method is fantastic... if it works for you. It is extremely simple to set up, but has the most limitations. However, for straightforward OMR extraction, it is highly effective with little setup involved.
This method uses a Data Field's List Values settings to function. Typically, the List Values property is configured to aid in human review during a Review step in a Batch Process. It allows you to enter a list of values the user can pick from during review. Labeled OMR has special interactivity with the List Values property. If you do not configure the Label Extractor, Grooper will check to see if any List Values have been entered. If so, it will attempt to match the items in List Values list as the OMR labels.
- This could be a knock-on benefit in that you might want to configure a List Values list for OMR fields regardless to make your document reviewer's work easier (and potentially more accurate).
- What is a group of checkboxes but a list of values to select from on a document? If human review is part of your process, you might be using List Values to give your document reviewers a selection list of checkbox options anyway. If it turns out those List Values match the OMR labels anyway, great! There's no need to configure a Label Extractor for Labeled OMR in that case. Two birds. One property configuration.
|
|
|
|
|
|
|
|
|
|
|
|
|
| ⚠ |
This method's strength and limitation lies in its simplicity. It will not work for every situation.
|
Using Label Sets
Harness the power of Label Sets. Simple set up. Easy output translation.
Grooper's Label Set functionality provides powerful document extraction and classification capabilities by leveraging the prevalence and utility of field labels. Labeled OMR is a "Label Set aware" extractor. OMR labels can be collected for a Data Field's set of labels and used at time of extraction in place of a Label Extractor.
If you are using Label Sets in your solution, this approach will most likely be the one for you. The setup is fairly simple, and translating/formatting your output is a breeze.
- This article presumes you have some awareness of Label Sets and the Labeling Behavior. For more information, please visit the Label Sets article.
|
Notice that we only have one area to add labels, but we have three options for the OMR. Instead of collecting a label, we're going to configure the Data Field. |
|
|
|
|
|
|
|
|
|
|
|
|
|
| ⚠ |
There is an order of operations if you configure multiple OMR label extraction/collection methods.
|
Maximum Noise
The concept of character noise is important to how Grooper isolates and filters out OMR label groups. A noise character is any alphanumeric character (not punctuation characters) that falls between OMR labels. Typically, OMR labels are grouped close together on a document with little to no other text between the labels. Grooper will filter out label matches with large numbers of characters between them.
|
For example, take these checkboxes using the labels "True" and "False". Yes, the labels are nearby checkboxes, but those same labels exist in the sentences to the right. How does Grooper distinguish between the OMR labels and those same words otherwise popping up on the document? Noise. |
|
|
First, Grooper will draw a box around the boundaries of each OMR label. |
|
|
Then, Grooper will dropout the labels and count the characters remaining within each boundary. These remaining characters are "noise". Grooper will count each character (alphanumeric only) to establish a noise count between potential OMR labels in a group. |
|
|
In cases where there are multiple label hits on the document, Grooper will use whichever OMR label group has the least amount of noise. |
The Maximum Noise property dictates how much noise is allowable between OMR labels in a group. By default, only 5 noise characters are allowed. This means if there are more than 5 noise characters between your OMR labels, Grooper will always toss out your OMR labels. However, this can be adjusted.
|
|
|
|
|
|
|
|
|
|
|
|
Header Labels
A "Header label" is a label that helps identify fields, sections or columns of information on a document. Header labels help distinguish the type of information listed on a document. They can be a huge help in identifying OMR checkbox labels that are very generic.
For example, "yes or no" style checkboxes are common on many forms, often with multiple questions asking "Yes" or "No" in response to a series of questions. On several (if not all) voter registration applications in the United States, you will see two questions.
- Are you a US Citizen?
- Will you be over 18 at the time of the election?
|
Often these are styled as OMR checkboxes with "yes or no" checkboxes next to them. The problem is the OMR labels are the same for both fields ("YES" and "NO"). How is Grooper going to disambiguate between the fields? How will know if you want to collect a value for one question or the other? |
|||
|
A Header Label will do the trick! In cases where you have multiple OMR label hits, Grooper will use the labels with the least amount of noise. So, if we gave the Labeled OMR extractor a Header Label matching the text |
|||
|
You can establish a Header Label in one of two ways:
- Using the Labeled OMR extractor's Header Extractor property.
- Collecting an OMR Data Field's Header label when using Label Sets.
Using the Header Extractor Property
|
|
|
|
|
|
|
|
|
|
|
|
|
Using Label Sets
|
If you're using Label Sets, all you need to do is collect a Data Field's Header label for your Document Types.
|
|
|
|
Configure the Extractor Part 2: OMR Modes
After you locate the OMR labels you must determine how checkboxes behave on your document and choose an OMR mode.
Checkboxes detail information in one of three ways, either:
- You will have several checkboxes next to several label options, giving you a list of choices. Of these choices, you may choose only one
- You will have several checkboxes next to several label options and you may choose multiple.
- You will have a single checkbox and it really just matters whether the checkbox is checked or not.
Labeled OMR has three corresponding Modes to account for this:
- CheckOne
- CheckMulti
- Boolean
How your document is formatted informs how the checkboxes behave, which will inform which mode you choose.
CheckOne
The CheckOne mode is the default OMR mode. This mode presumes for however many checkboxes are present on the document, only one box may be checked. Or, in other words, only one choice may be selected on the document.
|
|
|
|
CheckMulti
|
The CheckMulti mode presumes for multiple checkboxes any number of them may be checked. Or, in other words, multiple choices may be selected on the document. For example, this portion of the "Application for Cow Application" document has a "Cow Familiarity" section. The applicant would check each box that applies to their familiarity with cows (and cow-like lifeforms). So, multiple selections may be made. One box could be checked. Five boxes could be checked. One. Maybe even none. |
|
|
|
|
Boolean
|
The Boolean mode presumes there is a single checkbox and you're trying to determine if it is checked or not. For the CheckOne and CheckMulti modes, the goal is to extract information about which of a set of boxes are checked, returning OMR labels as the value. For the Boolean mode, you don't care about the label. Instead you want a "True/False" value depending on if the box is checked (True) or not (False). For example, on our "Application for Cow Application" there is an "Electronic Correspondence Option" field on the back page. Checking the box indicates the applicant wants to receive electronic mail. Leaving it unchecked indicates they want to receive paper mail. The Boolean mode will allow us to extract the value "True" if the box is checked or "False" if not. |
|
|
|
|
Additional Information
The Labeled OMR extractor will detect checkboxes in one of two ways:
- Using Layout Data
- Using circular box detection
Layout Data Box Detection
Prior to Grooper version 2021, Layout Data was always required. This means checkbox locations and their check states would need to first be detected by a box detecting IP Step (either Box Detection or Box Removal) in an IP Profile. When the IP Profile is executed, this data is stored in a page's "Grooper.LayoutData.json" file and used at time of extraction to find checkboxes near OMR labels and determine whether or not the box is checked.
| FYI |
Layout Data is saved when certain IP Commands execute in an IP Profile, such as Box Removal or Box Detection. This occurs when the IP Profile is applied during the following Grooper Activities:
|
|
|
In version 2021, the Labeled OMR extractor's functionality was expanded to allow for circular box detection at extraction time without Layout Data. Labeled OMR performs a box detection pass during extraction, analyzing "box-like" shapes nearby the OMR labels. Due to this, it is now also possible to use Labeled OMR to extract from rectangular boxes at extraction time as well (We will see this in the next section).
However, this is not preferred.
If you can collect Layout Data for rectangular checkboxes, you should. Grooper will always prioritize Layout Data over "circular box detection". Furthermore, you have much more control and customizability as to how rectangular checkboxes are detected using the Box Detection (or Box Removal) property grid.
Circular Box Detection (Radio Buttons)
Prior to version 2021, Grooper could only perform OMR using square or rectangular checkboxes. Circular checkboxes, also known as "radio buttons" could not be detected. This is because only square checkboxes can be detected and saved to to a document's Layout Data.
- There is only Box Detection and Box Removal. There is no "Circle Detection" image processing command than can be applied globally to a page. What looks an awful lot like an unchecked radio button? The letter "O". You'd end up with a lot of false positive hits on O's and zeroes and other circular artifacts that are not circular checkboxes.
|
|
|||
|
|
|||
|
|
| ‼ |
Labeled OMR is the only OMR based extractor type that will function without Layout Data. Both Ordered OMR and Zonal OMR must have Layout Data present to function. |





























































