2023:Labeled OMR (Value Extractor): Difference between revisions
No edit summary |
No edit summary |
||
| Line 462: | Line 462: | ||
# Remember, we said it was best practice to capture the full label when possible. | # Remember, we said it was best practice to capture the full label when possible. | ||
# Let's go ''against'' our own best practice advice and only capture part of the OMR labels. | # Let's go ''against'' our own best practice advice and only capture part of the OMR labels. | ||
#* i.e. <code> | #* i.e. <code>DOMESTIC</code> instead of <code>DOMESTIC(AGRICULTURE)</code> | ||
|valign=top| | |valign=top| | ||
[[File: | [[File:2023 Labeled OMR - 2023 03 How To 01 Maximum Noise 01.png]] | ||
|- | |- | ||
|valign=top| | |valign=top| | ||
| Line 470: | Line 470: | ||
# If we try to extract the field at this point, we will fail. No value is extracted. | # If we try to extract the field at this point, we will fail. No value is extracted. | ||
#* Why? Noise! | #* Why? Noise! | ||
# We have 15 noise characters between "FARM" and "SHOW". | |||
|valgin=top| | |||
[[File:2023 Labeled OMR - 2023 03 How To 01 Maximum Noise 02.png]] | |||
|- | |||
|valign=top| | |||
# The '''''Maximum Noise''''' property's default value is ''5''. | # The '''''Maximum Noise''''' property's default value is ''5''. | ||
#* This means there can be only 5 noise characters between the labels before Grooper tosses out the match. | #* This means there can be only 5 noise characters between the labels before Grooper tosses out the match. | ||
|valgin=top| | |valgin=top| | ||
[[File: | [[File:2023 Labeled OMR - 2023 03 How To 01 Maximum Noise 03.png]] | ||
|- | |- | ||
|valign=top| | |valign=top| | ||
| Line 481: | Line 485: | ||
# We have 15 noise characters. So, if we change the '''''Maximum Noise''''' property to ''15''... | # We have 15 noise characters. So, if we change the '''''Maximum Noise''''' property to ''15''... | ||
|valgin=top| | |||
[[File:2023 Labeled OMR - 2023 03 How To 01 Maximum Noise 04.png]] | |||
|- | |||
|valign=top| | |||
# The OMR value extracts. | # The OMR value extracts. | ||
|valing=top| | |valing=top| | ||
[[File: | [[File:2023 Labeled OMR - 2023 03 How To 01 Maximum Noise 05.png]] | ||
|} | |} | ||
| Line 536: | Line 544: | ||
#* <code>a. Has the applicant completed the Cow Operating Training Program accredited by the National Bovine Accrediting Board?</code> | #* <code>a. Has the applicant completed the Cow Operating Training Program accredited by the National Bovine Accrediting Board?</code> | ||
|valign=top| | |valign=top| | ||
[[File: | [[File:2023 Labeled OMR - 2023 04 How To 01 Using the Header Extractor Property 01.png]] | ||
|- | |- | ||
|valign=top| | |valign=top| | ||
| Line 542: | Line 550: | ||
# Without a Header Label, we're not collecting the correct value. | # Without a Header Label, we're not collecting the correct value. | ||
#* This "Yes/No" checkbox pertains to some different value. | #* This "Yes/No" checkbox pertains to some different value. | ||
|valign=top| | |||
[[File:2023 Labeled OMR - 2023 04 How To 01 Using the Header Extractor Property 02.png]] | |||
|- | |||
|valign=top| | |||
# We will configure the '''''Header Extractor''''' to filter out false positive label groups. | # We will configure the '''''Header Extractor''''' to filter out false positive label groups. | ||
# You can use whatever extraction method works best for your document set. We will use the '''''List Match''''' extractor (which is generally the most common for matching labels like this). | # You can use whatever extraction method works best for your document set. We will use the '''''List Match''''' extractor (which is generally the most common for matching labels like this). | ||
|valign=top| | |valign=top| | ||
[[File: | [[File:2023 Labeled OMR - 2023 04 How To 01 Using the Header Extractor Property 03.png]] | ||
|- | |- | ||
|valign=top| | |valign=top| | ||
| Line 552: | Line 564: | ||
# Now the '''''Labeled OMR''''' extractor will have some additional context and use this as an anchor for the OMR label groups. | # Now the '''''Labeled OMR''''' extractor will have some additional context and use this as an anchor for the OMR label groups. | ||
|valign=top| | |valign=top| | ||
[[File: | [[File:2023 Labeled OMR - 2023 04 How To 01 Using the Header Extractor Property 04.png]] | ||
|- | |- | ||
|valign=top| | |valign=top| | ||
| Line 561: | Line 573: | ||
# Ultimately, this gets us the result we want, the "Yes" or "No" OMR value labeled by our Header Label. | # Ultimately, this gets us the result we want, the "Yes" or "No" OMR value labeled by our Header Label. | ||
|valign=top| | |valign=top| | ||
[[File: | [[File:2023 Labeled OMR - 2023 04 How To 01 Using the Header Extractor Property 05.png]] | ||
|} | |} | ||
</tab> | </tab> | ||
Revision as of 09:18, 23 October 2023
|
WIP |
This article is a work-in-progress or created as a placeholder for testing purposes. This article is subject to change and/or expansion. It may be incomplete, inaccurate, or stop abruptly. This tag will be removed upon draft completion. |

Labeled OMR is an extractor used to output OMR checkbox labels. It determines whether labeled checkboxes are checked or not. If checked, it outputs the label(s) as the result.
About
|
You may download and import the file below into your own Grooper environment (version 2021). This contains a Batch with the example document(s) discussed in this tutorial and a Content Model configured according to its instructions.
|
Documents use checkboxes to make our life easier. They are particularly prevalent on structured forms. It gives the person filling out the form the ability to just check a box next to a series of options rather than typing in the information.
However, most of Grooper's extraction centers around regular expression, matching text patterns and returning the result. There isn't necessarily a character to match a checked checkbox. Regular expression isn't going to cut it to determine if a box is checked or not.
This is where OMR comes into play. OMR stands for "Optical Mark Recognition". OMR determines checkbox states. The basic idea behind it is very simple. First find a box. A box is just four lines connected to each other in a square-like fashion. If that box has a mark of some kind inside it, it is checked. If not, it's not. Checked (or marked) boxes, whether a checked "x" (☒), a checkmark (☑), or a check block (▣), while have more black pixels inside the box than an unchecked (or unmarked) one (☐). If the detected box has a high threshold of black pixels in it, it's checked (or marked). If not, it's unchecked (or unmarked).
A simple example would be a document asking a question and giving two boxes to check “Yes” or “No.” For example, see the portion of the document below asking if the applicant is a U.S. Citizen. “Yes” or “No” would be the labels. Either “Yes” or “No” would be the field's final result, depending on which box is checked. In this case, "Yes".
![]() |
In general, what you want to extract is the text of the checked label. The Labeled OMR extractor allows you to do just that.
|
First, you will set up an extractor to locate the text labels.
|
|
|
Then, Grooper's OMR detection will determine if there is a box next to the label, and whether or not that box is checked. |
|
|
Last, if the label is checked, the label is returned as the extractor's result. |
| FYI |
Labeled OMR has multiple extraction modes depending on how checkboxes behave on the document. There is also a Boolean mode to simply output "True" or "False" if a single checkbox is checked or not. We will discuss the different extraction modes further in the #How To section of this article. |
How To
Assign the Extractor
The Labeled OMR extractor can be utilized in two ways:
- As a Value Reader's extractor type.
- As an object's extractor property configuration. For example:
- As a Data Field's Value Extractor property's extractor configuration.
- As a Data Type's Local Extractor property's extractor configuration.
- As a Document Type's Positive Extractor property's extractor configuration.
- And more!
Value Reader
|
The Labeled OMR extractor is one of the extractor types available to the Value Reader extractor object.
|
|
|
Extractor Property
You may also configure a Labeled OMR extractor when configuring an extractor property. Many Grooper objects have some kind of extractor property in their property grids. Labeled OMR is one of the options that can be selected as the extractor type.
|
For example, Data Field objects have a Value Extractor property, which collects a result when the Data Model is extracted during the Extract activity.
|
|
|
Configure the Extractor Part 1: OMR Labels
The first part of the Labeled OMR extractor's configuration is label extraction. Labels can be collected in one of three ways:
- Using the Label Extractor property.
- Using the List Values settings of a Data Field.
- Collecting labels for the OMR labels when using Label Sets.
- When we get to this point, this article will presume you have some familiarity with Label Sets and the Labeling Behavior functionality. For more information on Label Sets please visit the Label Sets article.
|
|
At this point, the Labeled OMR extractor is totally unconfigured. Next, we will detail each of the three different ways to extract OMR labels. While the configuration is slightly different, the goal is the same: Locate text labels next to checkboxes. Each method has its own strengths and weaknesses, giving you flexibility in how you locate the OMR labels based on your documents' circumstances.
Using the Label Extractor
Moderate to high level of work up front. High flexibility in configuration options.
One way to locate OMR labels is by configuring the Labeled OMR extractor's Label Extractor property. In some ways, this is the most "effort intensive" of the three options. It will require you to configure an extractor to return each of the labels for the set of OMR checkboxes. This means a lot of manual configuration of property grids and/or external extractor objects, depending on the complexity of your documents.
However, it is also extremely reliable with a huge amount of flexibility. Since you configure an extractor to return the labels, you have all the extraction tools available to Grooper's suite of extraction types and extraction logic.
When other methods can't get the job done, configuring the Label Extractor property will be your go-to method to locate OMR labels.
|
|
|||
|
|
|||
|
|||
|
|
|||
|
|
Tips and Tricks: Translating Output
Often, it is the case the label on the document is not exactly what you want to collect for your data set. You may want to adjust the output value in one way or another. For example, you may want to collect the value "FARM" instead of the full text "FARM (Agriculture)".
This is easily done when using the List Match extractor for your Label Extractor.
|
|
|
|
|
|
|
|
Using a Data Field's List Values
A simple solution for the most simple cases.
This next method is fantastic... if it works for you. It is extremely simple to set up, but has the most limitations. However, for straightforward OMR extraction, it is highly effective with little setup involved.
This method uses a Data Field's List Values settings to function. Typically, the List Values property is configured to aid in human review during a Review step in a Batch Process. It allows you to enter a list of values the user can pick from during review. Labeled OMR has special interactivity with the List Values property. If you do not configure the Label Extractor, Grooper will check to see if any List Values have been entered. If so, it will attempt to match the items in List Values list as the OMR labels.
- This could be a knock-on benefit in that you might want to configure a List Values list for OMR fields regardless to make your document reviewer's work easier (and potentially more accurate).
- What is a group of checkboxes but a list of values to select from on a document? If human review is part of your process, you might be using List Values to give your document reviewers a selection list of checkbox options anyway. If it turns out those List Values match the OMR labels anyway, great! There's no need to configure a Label Extractor for Labeled OMR in that case. Two birds. One property configuration.
|
|
|
|
|
|
|
|
|
|
|
|
|
| ⚠ |
This method's strength and limitation lies in its simplicity. It will not work for every situation.
|
Using Label Sets
Harness the power of Label Sets. Simple set up. Easy output translation.
Grooper's Label Set functionality provides powerful document extraction and classification capabilities by leveraging the prevalence and utility of field labels. Labeled OMR is a "Label Set aware" extractor. OMR labels can be collected for a Data Field's set of labels and used at time of extraction in place of a Label Extractor.
If you are using Label Sets in your solution, this approach will most likely be the one for you. The setup is fairly simple, and translating/formatting your output is a breeze.
- This article presumes you have some awareness of Label Sets and the Labeling Behavior. For more information, please visit the Label Sets article.
|
|
|
|
|
|
|
|
|
|
|
| ⚠ |
There is an order of operations if you configure multiple OMR label extraction/collection methods.
|
Maximum Noise
The concept of character noise is important to how Grooper isolates and filters out OMR label groups. A noise character is any alphanumeric character (not punctuation characters) that falls between OMR labels. Typically, OMR labels are grouped close together on a document with little to no other text between the labels. Grooper will filter out label matches with large numbers of characters between them.
|
For example, take these checkboxes using the labels "True" and "False". Yes, the labels are nearby checkboxes, but those same labels exist in the sentences to the right. How does Grooper distinguish between the OMR labels and those same words otherwise popping up on the document? Noise. |
|
|
First, Grooper will draw a box around the boundaries of each OMR label. |
|
|
Then, Grooper will dropout the labels and count the characters remaining within each boundary. These remaining characters are "noise". Grooper will count each character (alphanumeric only) to establish a noise count between potential OMR labels in a group. |
|
|
In cases where there are multiple label hits on the document, Grooper will use whichever OMR label group has the least amount of noise. |
The Maximum Noise property dictates how much noise is allowable between OMR labels in a group. By default, only 5 noise characters are allowed. This means if there are more than 5 noise characters between your OMR labels, Grooper will always toss out your OMR labels. However, this can be adjusted.
|
|
|
|
|
|
|
|
|
|
|
|
Header Labels
A "Header label" is a label that helps identify fields, sections or columns of information on a document. Header labels help distinguish the type of information listed on a document. They can be a huge help in identifying OMR checkbox labels that are very generic.
For example, "yes or no" style checkboxes are common on many forms, often with multiple questions asking "Yes" or "No" in response to a series of questions. On several (if not all) voter registration applications in the United States, you will see two questions.
- Are you a US Citizen?
- Will you be over 18 at the time of the election?
|
Often these are styled as OMR checkboxes with "yes or no" checkboxes next to them. The problem is the OMR labels are the same for both fields ("YES" and "NO"). How is Grooper going to disambiguate between the fields? How will know if you want to collect a value for one question or the other? |
|||
|
A Header Label will do the trick! In cases where you have multiple OMR label hits, Grooper will use the labels with the least amount of noise. So, if we gave the Labeled OMR extractor a Header Label matching the text |
|||
|
You can establish a Header Label in one of two ways:
- Using the Labeled OMR extractor's Header Extractor property.
- Collecting an OMR Data Field's Header label when using Label Sets.
Using the Header Extractor Property
|
|
|
|
|
|
|
|
|
|
|
|
|
Using Label Sets
|
If you're using Label Sets, all you need to do is collect a Data Field's Header label for your Document Types.
|
|
|
|















































