2021:Labeled OMR (Extractor Type)

Labeled OMR is an extractor used to output OMR checkbox labels. It determines whether labeled checkboxes are checked or not. If checked, it outputs the label(s) as the result.
About
Documents use checkboxes to make our life easier. They are particularly prevalent on structured forms. It gives the person filling out the form the ability to just check a box next to a series of options rather than typing in the information.
However, most of Grooper's extraction centers around regular expression, matching text patterns and returning the result. There isn't necessarily a character to match a checked checkbox. Regular expression isn't going to cut it to determine if a box is checked or not.
This is where OMR comes into play. OMR stands for "Optical Mark Recognition". OMR determines checkbox states. The basic idea behind it is very simple. First find a box. A box is just four lines connected to each other in a square-like fashion. If that box has a mark of some kind inside it, it is checked. If not, it's not. Checked (or marked) boxes, whether a checked "x" (☒), a checkmark (☑), or a check block (▣), while have more black pixels inside the box than an unchecked (or unmarked) one (☐). If the detected box has a high threshold of black pixels in it, it's checked (or marked). If not, it's unchecked (or unmarked).
A simple example would be a document asking a question and giving two boxes to check “Yes” or “No.” For example, see the portion of the document below asking if the applicant is a U.S. Citizen. “Yes” or “No” would be the labels. Either “Yes” or “No” would be the field's final result, depending on which box is checked. In this case, "Yes".
![]() |
![]() |
In general, what you want to extract is the text of the checked label. The Labeled OMR extractor allows you to do just that.
First, you will set up an extractor to locate the text labels.
|
|
Then, Grooper's OMR detection will determine if there is a box next to the label, and whether or not that box is checked. |
|
Last, if the label is checked, the label is returned as the extractor's result. |
FYI |
Labeled OMR has multiple extraction modes depending on how checkboxes behave on the document. There is also a Boolean mode to simply output "True" or "False" if a single checkbox is checked or not. We will discuss the different extraction modes further in the #How To section of this article. |
How To
Assign the Extractor
The Labeled OMR extractor can be utilized in two ways:
- As a Value Reader's extractor type.
- As an object's extractor property configuration. For example:
- As a Data Field's Value Extractor property's extractor configuration.
- As a Data Type's Local Extractor property's extractor configuration.
- As a Document Type's Positive Extractor property's extractor configuration.
- And more!
Value Reader
The Labeled OMR extractor is one of the extractor types available to the Value Reader extractor object.
|
Extractor Property
You may also configure a Labeled OMR extractor when configuring an extractor property. Many Grooper objects have some kind of extractor property in their property grids. Labeled OMR is one of the options that can be selected as the extractor type.
For example, Data Field objects have a Value Extractor property, which collects a result when the Data Model is extracted during the Extract activity.
|
Configure the Extractor Part 1: OMR Labels
The first part of the Labeled OMR extractor's configuration is label extraction. Labels can be collected in one of three ways:
- Using the Label Extractor property.
- Using the List Values settings of a Data Field.
- Collecting labels for the OMR labels when using Label Sets.
- When we get to this point, this article will presume you have some familiarity with Label Sets and the Labeling Behavior functionality. For more information on Label Sets please visit the Label Sets article.
|
At this point, the Labeled OMR extractor is totally unconfigured. Next, we will detail each of the three different ways to extract OMR labels. While the configuration is slightly different, the goal is the same: Locate text labels next to checkboxes. Each method has its own strengths and weaknesses, giving you flexibility in how you locate the OMR labels based on your documents' circumstances.
Using the Label Extractor
Moderate to high level of work up front. High flexibility in configuration options.
One way to locate OMR labels is by configuring the Labeled OMR extractor's Label Extractor property. In some ways, this is the most "effort intensive" of the three options. It will require you to configure an extractor to return each of the labels for the set of OMR checkboxes. This means a lot of manual configuration of property grids and/or external extractor objects, depending on the complexity of your documents.
However, it is also extremely reliable with a huge amount of flexibility. Since you configure an extractor to return the labels, you have all the extraction tools available to Grooper's suite of extraction types and extraction logic.
When other methods can't get the job done, configuring the Label Extractor property will be your go-to method to locate OMR labels.
|
|||
|
|||
|
|||
|
Tips and Tricks: Translating Output
Often, it is the case the label on the document is not exactly what you want to collect for your data set. You may want to adjust the output value in one way or another. For example, you may want to collect the value "FARM" instead of the full text "FARM (Agriculture)".
This is easily done when using the List Match extractor for your Label Extractor.
|
|
|
|
|
Using a Data Field's List Values
A simple solution for the most simple cases.
This next method is fantastic... if it works for you. It is extremely simple to set up, but has the most limitations. However, for straightforward OMR extraction, it is highly effective with little setup involved.
This method uses a Data Field's List Values settings to function. Typically, the List Values property is configured to aid in human review during a Review step in a Batch Process. It allows you to enter a list of values the user can pick from during review. Labeled OMR has special interactivity with the List Values property. If you do not configure the Label Extractor, Grooper will check to see if any List Values have been entered. If so, it will attempt to match the items in List Values list as the OMR labels.
- This could be a knock-on benefit in that you might want to configure a List Values list for OMR fields regardless to make your document reviewer's work easier (and potentially more accurate).
- What is a group of checkboxes but a list of values to select from on a document? If human review is part of your process, you might be using List Values to give your document reviewers a selection list of checkbox options anyway. If it turns out those List Values match the OMR labels anyway, great! There's no need to configure a Label Extractor for Labeled OMR in that case. Two birds. One property configuration.
|
|
|
|
|
|
|
⚠ |
This method's strength and limitation lies in its simplicity. It will not work for every situation.
|
Using Label Sets
Harness the power of Label Sets. Simple set up. Easy output translation.
Grooper's Label Set functionality provides powerful document extraction and classification capabilities by leveraging the prevalence and utility of field labels. Labeled OMR is a "Label Set aware" extractor. OMR labels can be collected for a Data Field's set of labels and used at time of extraction in place of a Label Extractor.
If you are using Label Sets in your solution, this approach will most likely be the one for you. The setup is fairly simple, and translating/formatting your output is a breeze.
- This article presumes you have some awareness of Label Sets and the Labeling Behavior. For more information, please visit the Label Sets article.
|
|
|
|
|
|
|
⚠ |
There is an order of operations if you configure multiple OMR label extraction/collection methods.
|
Maximum Noise
The concept of character noise is important to how Grooper isolates and filters out OMR label groups. A noise character is any alphanumeric character (not punctuation characters) that falls between OMR labels. Typically, OMR labels are grouped close together on a document with little to no other text between the labels. Grooper will filter out label matches with large numbers of characters between them.
For example, take these checkboxes using the labels "True" and "False". Yes, the labels are nearby checkboxes, but those same labels exist in the sentences to the right. How does Grooper distinguish between the OMR labels and those same words otherwise popping up on the document? Noise. |
|
First, Grooper will draw a box around the boundaries of each OMR label. |
|
Then, Grooper will dropout the labels and count the characters remaining within each boundary. These remaining characters are "noise". Grooper will count each character (alphanumeric only) to establish a noise count between potential OMR labels in a group. |
|
In cases where there are multiple label hits on the document, Grooper will use whichever OMR label group has the least amount of noise. |
The Maximum Noise property dictates how much noise is allowable between OMR labels in a group. By default, only 5 noise characters are allowed. This means if there are more than 5 noise characters between your OMR labels, Grooper will always toss out your OMR labels. However, this can be adjusted.
|
|
|
|
|
Header Labels
A "Header label" is a label that helps identify fields, sections or columns of information on a document. Header labels help distinguish the type of information listed on a document. They can be a huge help in identifying OMR checkbox labels that are very generic.
For example, "yes or no" style checkboxes are common on many forms, often with multiple questions asking "Yes" or "No" in response to a series of questions. On several (if not all) voter registration applications in the United States, you will see two questions.
- Are you a US Citizen?
- Will you be over 18 at the time of the election?
Often these are styled as OMR checkboxes with "yes or no" checkboxes next to them. The problem is the OMR labels are the same for both fields ("YES" and "NO"). How is Grooper going to disambiguate between the fields? How will know if you want to collect a value for one question or the other? |
|||
A Header Label will do the trick! In cases where you have multiple OMR label hits, Grooper will use the labels with the least amount of noise. So, if we gave the Labeled OMR extractor a Header Label matching the text |
|||
|
You can establish a Header Label in one of two ways:
- Using the Labeled OMR extractor's Header Extractor property.
- Collecting an OMR Data Field's Header label when using Label Sets.
Using the Header Extractor Property
|
|
|
|
|
|
|
Using Label Sets
If you're using Label Sets, all you need to do is collect a Data Field's Header label for your Document Types.
|
|
|
Configure the Extractor Part 2: OMR Modes
After you locate the OMR labels you must determine how checkboxes behave on your document and choose an OMR mode.
Checkboxes detail information in one of three ways, either:
- You will have several checkboxes next to several label options, giving you a list of choices. Of these choices, you may choose only one
- You will have several checkboxes next to several label options and you may choose multiple.
- You will have a single checkbox and it really just matters whether the checkbox is checked or not.
Labeled OMR has three corresponding Modes to account for this:
- CheckOne
- CheckMulti
- Boolean
How your document is formatted informs how the checkboxes behave, which will inform which mode you choose.
CheckOne
The CheckOne mode is the default OMR mode. This mode presumes for however many checkboxes are present on the document, only one box may be checked. Or, in other words, only one choice may be selected on the document.
|
CheckMulti
The CheckMulti mode presumes for multiple checkboxes any number of them may be checked. Or, in other words, multiple choices may be selected on the document. For example, this portion of the "Application for Cow Application" document has a "Cow Familiarity" section. The applicant would check each box that applies to their familiarity with cows (and cow-like lifeforms). So, multiple selections may be made. One box could be checked. Five boxes could be checked. One. Maybe even none. |
|
Boolean
The Boolean mode presumes there is a single checkbox and you're trying to determine if it is checked or not. For the CheckOne and CheckMulti modes, the goal is to extract information about which of a set of boxes are checked, returning OMR labels as the value. For the Boolean mode, you don't care about the label. Instead you want a "True/False" value depending on if the box is checked (True) or not (False). For example, on our "Application for Cow Application" there is an "Electronic Correspondence Option" field on the back page. Checking the box indicates the applicant wants to receive electronic mail. Leaving it unchecked indicates they want to receive paper mail. The Boolean mode will allow us to extract the value "True" if the box is checked or "False" if not. |
|
Additional Information
The Labeled OMR extractor will detect checkboxes in one of two ways:
- Using Layout Data
- Using circular box detection
Layout Data Box Detection
Prior to Grooper version 2021, Layout Data was always required. This means checkbox locations and their check states would need to first be detected by a box detecting IP Step (either Box Detection or Box Removal) in an IP Profile. When the IP Profile is executed, this data is stored in a page's "Grooper.LayoutData.json" file and used at time of extraction to find checkboxes near OMR labels and determine whether or not the box is checked.
FYI |
Layout Data is saved when certain IP Commands execute in an IP Profile, such as Box Removal or Box Detection. This occurs when the IP Profile is applied during the following Grooper Activities:
|
|
In version 2021, the Labeled OMR extractor's functionality was expanded to allow for circular box detection at extraction time without Layout Data. Labeled OMR performs a box detection pass during extraction, analyzing "box-like" shapes nearby the OMR labels. Due to this, it is now also possible to use Labeled OMR to extract from rectangular boxes at extraction time as well (We will see this in the next section).
However, this is not preferred.
If you can collect Layout Data for rectangular checkboxes, you should. Grooper will always prioritize Layout Data over "circular box detection". Furthermore, you have much more control and customizability as to how rectangular checkboxes are detected using the Box Detection (or Box Removal) property grid.
Circular Box Detection (Radio Buttons)
Prior to version 2021, Grooper could only perform OMR using square or rectangular checkboxes. Circular checkboxes, also known as "radio buttons" could not be detected. This is because only square checkboxes can be detected and saved to to a document's Layout Data.
- There is only Box Detection and Box Removal. There is no "Circle Detection" image processing command than can be applied globally to a page. What looks an awful lot like an unchecked radio button? The letter "O". You'd end up with a lot of false positive hits on O's and zeroes and other circular artifacts that are not circular checkboxes.
|
|||
|
|||
|
‼ |
Labeled OMR is the only OMR based extractor type that will function without Layout Data. Both Ordered OMR and Zonal OMR must have Layout Data present to function. |
Version Differences
Radio Button Detection (2021)
In version 2021, the Labeled OMR extractor's functionality was expanded to allow for radio button extraction. Prior to 2021, Grooper could only perform OMR using square or rectangular checkboxes. In 2021, Labeled OMR is able to analyze checkboxes at the time of extraction, granting it the ability to detect radio buttons.
- This also increased the extractor's functionality in that Layout Data is no longer strictly required for the extractor to function. It will use that Layout Data if present (which is preferred when possible), but it can now analyze the image at time of extraction if Layout Data is not present.
Extractor Expansion (2021)
Prior to version 2021, the Labeled OMR extractor type was only accessible using the Value Extractor property of certain objects, such as Data Fields.
In version 2021, all extractor types were expanded to the various extractor properties of all objects. This allows the Labeled OMR extractor to be utilized in ways never before possible.
- For example, a Data Type extractor can now use Labeled OMR as its Local Extractor's extractor type. Prior to 2021, you could not use Labeled OMR to extract OMR labels using an extractor object.
Labeled OMR Introduction (2.90)
In version 2.80, Labeled OMR is referred to as Anchored OMR. The two features are configured and function similarly.
Prior to version 2.80, this functionality would been performed using the "Data Element Profiles" tab of a Document Type and drawing "OMR Zones" around the checkboxes to read their check states. Grooper has moved away from "Data Element Profiles" in favor of configuring the functionality directly on Data Elements in a Data Model, using extractor types such as Labeled OMR.