2023:Labeled OMR (Value Extractor): Difference between revisions

Revision as of 08:18, 23 October 2023

WIP

This article is a work-in-progress or created as a placeholder for testing purposes. This article is subject to change and/or expansion. It may be incomplete, inaccurate, or stop abruptly.

This tag will be removed upon draft completion.

Labeled OMR is an extractor used to output OMR checkbox labels. It determines whether labeled checkboxes are checked or not. If checked, it outputs the label(s) as the result.

About

You may download and import the file below into your own Grooper environment (version 2021). This contains a Batch with the example document(s) discussed in this tutorial and a Content Model configured according to its instructions.

Media:Wiki - Labeled OMR - v2021.zip
Media:Wiki - Labeled OMR (Label Sets) - v2021.zip
- This contains a Batch and Content Model that use Label Sets. Import this if your license has Label Sets enabled and want to review Labeled OMR's Label Set aware functionality.

Documents use checkboxes to make our life easier. They are particularly prevalent on structured forms. It gives the person filling out the form the ability to just check a box next to a series of options rather than typing in the information.

However, most of Grooper's extraction centers around regular expression, matching text patterns and returning the result. There isn't necessarily a character to match a checked checkbox. Regular expression isn't going to cut it to determine if a box is checked or not.

This is where OMR comes into play. OMR stands for "Optical Mark Recognition". OMR determines checkbox states. The basic idea behind it is very simple. First find a box. A box is just four lines connected to each other in a square-like fashion. If that box has a mark of some kind inside it, it is checked. If not, it's not. Checked (or marked) boxes, whether a checked "x" (☒), a checkmark (☑), or a check block (▣), while have more black pixels inside the box than an unchecked (or unmarked) one (☐). If the detected box has a high threshold of black pixels in it, it's checked (or marked). If not, it's unchecked (or unmarked).

A simple example would be a document asking a question and giving two boxes to check “Yes” or “No.” For example, see the portion of the document below asking if the applicant is a U.S. Citizen. “Yes” or “No” would be the labels. Either “Yes” or “No” would be the field's final result, depending on which box is checked. In this case, "Yes".

In general, what you want to extract is the text of the checked label. The Labeled OMR extractor allows you to do just that.

First, you will set up an extractor to locate the text labels.

Or, use Label Sets to locate the text tables.

Then, Grooper's OMR detection will determine if there is a box next to the label, and whether or not that box is checked.

Last, if the label is checked, the label is returned as the extractor's result.

FYI

Labeled OMR has multiple extraction modes depending on how checkboxes behave on the document. There is also a Boolean mode to simply output "True" or "False" if a single checkbox is checked or not. We will discuss the different extraction modes further in the #How To section of this article.

How To

Assign the Extractor

The Labeled OMR extractor can be utilized in two ways:

As a Value Reader's extractor type.
As an object's extractor property configuration. For example:
- As a Data Field's Value Extractor property's extractor configuration.
- As a Data Type's Local Extractor property's extractor configuration.
- As a Document Type's Positive Extractor property's extractor configuration.
- And more!

Value ReaderExtractor Property

Value Reader

The Labeled OMR extractor is one of the extractor types available to the Value Reader extractor object.

This is a Value Reader object.
For its Extractor Type property, Labeled OMR is selected.
The Labeled OMR extractor's configuration is set up in this property grid.

This Labeled OMR extractor is configured to determine which box is checked of three options on the document (DOMESTIC, FARM, or SHOW).
The Value Reader returns the label with a checked box next to it (DOMESTIC on this document).

Extractor Property

You may also configure a Labeled OMR extractor when configuring an extractor property. Many Grooper objects have some kind of extractor property in their property grids. Labeled OMR is one of the options that can be selected as the extractor type.

For example, Data Field objects have a Value Extractor property, which collects a result when the Data Model is extracted during the Extract activity.

This is a Data Field object.
For its Value Extractor property, Labeled OMR is selected.
The Labeled OMR extractor's configuration is set up in the Value Extractor property's sub-property grid, OR can be configured using the Extractor Editor window by pressing the ellipsis button by the Value Extractor property.

This Labeled OMR extractor is configured to determine which box is checked of three options on the document (DOMESTIC, FARM, or SHOW).
When the Data Field is collected during the Extract activity, the label with a checked box next to it (DOMESTIC on this document) is returned.

Configure the Extractor Part 1: OMR Labels

The first part of the Labeled OMR extractor's configuration is label extraction. Labels can be collected in one of three ways:

Using the Label Extractor property.
Using the List Values settings of a Data Field.
Collecting labels for the OMR labels when using Label Sets.
- When we get to this point, this article will presume you have some familiarity with Label Sets and the Labeling Behavior functionality. For more information on Label Sets please visit the Label Sets article.

To illustrate this, we will configure extraction for a single Data Field, detailing how each of these three different methods get a result.

Our example document is an "Application For Cow Ownership" form.
This form lists the "Type of Cow Applied for" using checkboxes
- Either "DOMESTIC", "FARM", or "SHOW".
We will use this Data Field named "13. Type of Cow" to collect the choice indicated on the document.
- We have assigned the Data Field's Value Extractor to Labeled OMR.

FYI

Be aware the Mode property is very important to Labeled OMR's configuration.

For the time being, we will use the default CheckOne mode. This will presume for multiple labeled boxes only one may be checked. We will discuss the different OMR modes in the #Configure the Extractor Part 2: OMR Modes section of this article.

At this point, the Labeled OMR extractor is totally unconfigured. Next, we will detail each of the three different ways to extract OMR labels. While the configuration is slightly different, the goal is the same: Locate text labels next to checkboxes. Each method has its own strengths and weaknesses, giving you flexibility in how you locate the OMR labels based on your documents' circumstances.

Using the Label ExtractorUsing a Data Field's List ValuesUsing Label Sets

Using the Label Extractor

Moderate to high level of work up front. High flexibility in configuration options.

One way to locate OMR labels is by configuring the Labeled OMR extractor's Label Extractor property. In some ways, this is the most "effort intensive" of the three options. It will require you to configure an extractor to return each of the labels for the set of OMR checkboxes. This means a lot of manual configuration of property grids and/or external extractor objects, depending on the complexity of your documents.

However, it is also extremely reliable with a huge amount of flexibility. Since you configure an extractor to return the labels, you have all the extraction tools available to Grooper's suite of extraction types and extraction logic.

When other methods can't get the job done, configuring the Label Extractor property will be your go-to method to locate OMR labels.

For this method OMR labels are located using an extractor's results.

This extractor is configured using the Label Extractor property.
Select the extractor type you wish to configure using the dropdown list.
- Most often, you will use the List Match extractor to return labels next to OMR checkboxes. We will select List Match for this exercise.
- However, you can use whatever extractor types and techniques you choose, as long as the extractors end results are your OMR labels.

We will configure the extractor in the Extractor Editor by pressing the ellipsis button at the end of the Label Extractor property.

Regardless of the specific extractor you choose to configure, your goal will be the same. Return one result for each individual label in the group of checkboxes.
- These will supply Labeled OMR with data instances that should be have checkboxes nearby. Then, Grooper will look for checkboxes around the data instances, determine which ones are checked, and return whichever data instance has a checked box next to it.
- In our case, we're wanting to return the labels "DOMESTIC (Home)" "FARM (Agriculture)" and "SHOW (Beauty)".

Essentially, we want to return a list of OMR labels. The List Match extractor is well-suited for this task.

Using the Local Entries list type the list of OMR labels.
- DOMESTIC (Home)
- FARM (Agriculture)
- SHOW (Beauty)
Ensure the extractor returns data instances next to the OMR checkboxes on the document.
The extractor should return one result for each individual label.

FYI

It is generally preferable to capture the full label when possible.

For example, we would want to collect "DOMESTIC (Home)" rather than "DOMESTIC". We can always translate our result's output later (which we will do shortly).

Why is this important? It has to do with how Grooper isolates and filters label groups using "noise characters". We will discuss this further in the #Maximum Noise section of this article.

That's it! Grooper will now analyze the pixels around the labels' data instances, determine if anything around it is a checkbox, determine their checkbox states (checked or not checked), and return the data instance (in other words, the OMR label) next to a checked box.

Now that the Label Extractor is configured, the Labeled OMR extractor can find OMR labels. Grooper determined there was a checked box next to the label DOMESTIC (HOME)
That label is returned, populating the Data Field.

Tips and Tricks: Translating Output

Often, it is the case the label on the document is not exactly what you want to collect for your data set. You may want to adjust the output value in one way or another. For example, you may want to collect the value "FARM" instead of the full text "FARM (Agriculture)".

This is easily done when using the List Match extractor for your Label Extractor.

We're going to accomplish this by editing our Label Extractor's List Match configuration, using the Extractor Editor window.
Navigate to the "Properties" tab.
Under the Output properties, change the Translate property to True.

List translations are made using the equals sign = using the following syntax:

Extracted Value=Translated Output

We've translated our OMR labels to exempt the text in parenthesis.
- DOMESTIC (Home)=DOMESTIC
- FARM (Agriculture)=FARM
- SHOW (Beauty)=SHOW
We still match the full result, meaning we capture the full label for the data instance.
- This is ideal for OMR detection. Grooper will have an easier time "figuring out" where checkboxes are in context to a label if it has the full label to work with.
However, the output is translated to the format we want.

This gives us the best of both worlds!

Grooper finds the label next to the checked box, and returns the formatted value we want.

Using a Data Field's List Values

A simple solution for the most simple cases.

This next method is fantastic... if it works for you. It is extremely simple to set up, but has the most limitations. However, for straightforward OMR extraction, it is highly effective with little setup involved.

This method uses a Data Field's List Values settings to function. Typically, the List Values property is configured to aid in human review during a Review step in a Batch Process. It allows you to enter a list of values the user can pick from during review. Labeled OMR has special interactivity with the List Values property. If you do not configure the Label Extractor, Grooper will check to see if any List Values have been entered. If so, it will attempt to match the items in List Values list as the OMR labels.

This could be a knock-on benefit in that you might want to configure a List Values list for OMR fields regardless to make your document reviewer's work easier (and potentially more accurate).
What is a group of checkboxes but a list of values to select from on a document? If human review is part of your process, you might be using List Values to give your document reviewers a selection list of checkbox options anyway. If it turns out those List Values match the OMR labels anyway, great! There's no need to configure a Label Extractor for Labeled OMR in that case. Two birds. One property configuration.

For this method, OMR labels are collected using the List Values settings of a Data Field. First, select the Data Field. Using the Value Extractor property, select Labeled OMR.
Scroll down to the bottom of the Data Field's property grid and expand the List Values property. Select the Local Entries entries property and press the ellipsis button at the end.
This will bring up a List Editor window. The goal here is to make a list of the OMR labels. Type each OMR label on one line in the List Editor.
With the Labeled OMR extractor's Label Extractor property left unconfigured, Grooper will use the List Values in its place. Grooper determined there was a checked box next to the label `FARM (Agriculture)` That label is returned, populating the Data Field.
As an added benefit for a document review step, the normal functionality of the List Values property is also implemented. The review user will be present with a dropdown selection list containing the OMR labels you added to the List Values list.

⚠

This method's strength and limitation lies in its simplicity. It will not work for every situation.

If you cannot extract an OMR label by matching it to a simple list item (and instead must rely on more advanced extractors or techniques to match an OMR label), the List Values method will not work.
If you need to translate an OMR label to a different output value, the List Values method is inferior. Using a Label Extractor or Label Sets are better suited for formatting OMR labels to the value you want to output.

Using Label Sets

Harness the power of Label Sets. Simple set up. Easy output translation.

Grooper's Label Set functionality provides powerful document extraction and classification capabilities by leveraging the prevalence and utility of field labels. Labeled OMR is a "Label Set aware" extractor. OMR labels can be collected for a Data Field's set of labels and used at time of extraction in place of a Label Extractor.

If you are using Label Sets in your solution, this approach will most likely be the one for you. The setup is fairly simple, and translating/formatting your output is a breeze.

This article presumes you have some awareness of Label Sets and the Labeling Behavior. For more information, please visit the Label Sets article.

For this method, OMR labels are located using label's collected in a Document Type's Label Set. The only "trick" to this method is giving Grooper some additional information so that you can collect labels for the OMR values. We've already enabled the Labeling Behavior on our Content Model. We've navigated to the "Labels" tab to collect labels for our Data Field. The "13. Type of Cow" Data Field is as of yet unconfigured. Notice we only have three label types available for capture. The Header label. The Footer label. The Static label. Which of these do we use to capture our three OMR labels ("DOMESTIC (HOME)" "FARM (Agriculture)" and "SHOW (Beauty)")? None of them! Keep in mind OMR fields are going to behave differently on a document. We have three labels here. Another document might have ten. Or twenty. Or just a single yes/no style checkbox. The label types we have here just aren't going to cut it.
Grooper doesn't know how many checkboxes and corresponding labels you have until you give it some more information. We need to collect individual labels for each checkbox. We must define each OMR label first using the Data Field's List Values property. Select the Data Field. Scroll to the bottom of the Data Field's property grid and expand the List Values property. Select the Local Entries property and press the ellipsis button at the end. This will bring up a List Editor window. The difference between using Label Sets and using List Values alone is you can enter whatever you want for the ORM labels' values here. Imagine we have a requirement in whatever backend system stores these values. Each result can only be "DMST" "FARM" or "SHOW". Well, "DMST" doesn't even appear on the document. But, it doesn't matter. Grooper isn't going to use these values to match against the document. They're merely providing a list of values for which we will capture a label later.
Now, we will be able to collect labels for each item we've added to the list. Navigate back to the Content Model Navigate to the "Labels" tab. Select the document assigned the Document Type whose labels you want to collect. Select the Data Field whose OMR values you've added to its List Values list. Check it out! We have now labels we can collect. You are able to capture labels for each value added to the list. Capture the full label for the corresponding OMR value. Grooper will use the the captured label to locate the checkbox and determine its state, but output the value in the List Values list.
All that's left is to assign the Data Field's extractor. Select the Data Field in the Node Tree. Change the Value Extractor property to Labeled OMR. The extractor will use the collected labels in the Label Set to match the OMR labels, find nearby checkboxes, and determine their check states. The output value will be the corresponding value we entered in the Value List' for the label.

Click here to return to the top of this section

⚠

There is an order of operations if you configure multiple OMR label extraction/collection methods.

Label Extractor
- If the Label Extractor property is configured, it always takes priority and will be used to locate OMR labels.
Label Sets
- If the Label Extractor property is not configured, collected labels in the Document Type's Label Set will take priority.
List Values
- If the Label Extractor property is not configured, and no labels are collected, the List Values list items will be used to locate OMR labels.

Maximum Noise

The concept of character noise is important to how Grooper isolates and filters out OMR label groups. A noise character is any alphanumeric character (not punctuation characters) that falls between OMR labels. Typically, OMR labels are grouped close together on a document with little to no other text between the labels. Grooper will filter out label matches with large numbers of characters between them.

For example, take these checkboxes using the labels "True" and "False". Yes, the labels are nearby checkboxes, but those same labels exist in the sentences to the right. How does Grooper distinguish between the OMR labels and those same words otherwise popping up on the document? Noise.
First, Grooper will draw a box around the boundaries of each OMR label.
Then, Grooper will dropout the labels and count the characters remaining within each boundary. These remaining characters are "noise". Grooper will count each character (alphanumeric only) to establish a noise count between potential OMR labels in a group.
In cases where there are multiple label hits on the document, Grooper will use whichever OMR label group has the least amount of noise.

The Maximum Noise property dictates how much noise is allowable between OMR labels in a group. By default, only 5 noise characters are allowed. This means if there are more than 5 noise characters between your OMR labels, Grooper will always toss out your OMR labels. However, this can be adjusted.

Let's go back to our previous example and take another look the Label Extractor we used to locate OMR labels for the "Type of Cow Applied for" field. We used a List Match extractor to match these OMR labels. Remember, we said it was best practice to capture the full label when possible. Let's go against our own best practice advice and only capture part of the OMR labels. i.e. `DOMESTIC` instead of `DOMESTIC(AGRICULTURE)`
If we try to extract the field at this point, we will fail. No value is extracted. Why? Noise! We have 15 noise characters between "FARM" and "SHOW".
The Maximum Noise property's default value is 5. This means there can be only 5 noise characters between the labels before Grooper tosses out the match.
This property is, however, adjustable to allow for a higher noise threshold. We have 15 noise characters. So, if we change the Maximum Noise property to 15...
The OMR value extracts.

Header Labels

A "Header label" is a label that helps identify fields, sections or columns of information on a document. Header labels help distinguish the type of information listed on a document. They can be a huge help in identifying OMR checkbox labels that are very generic.

For example, "yes or no" style checkboxes are common on many forms, often with multiple questions asking "Yes" or "No" in response to a series of questions. On several (if not all) voter registration applications in the United States, you will see two questions.

Are you a US Citizen?
Will you be over 18 at the time of the election?

Often these are styled as OMR checkboxes with "yes or no" checkboxes next to them.

The problem is the OMR labels are the same for both fields ("YES" and "NO"). How is Grooper going to disambiguate between the fields? How will know if you want to collect a value for one question or the other?

A Header Label will do the trick! In cases where you have multiple OMR label hits, Grooper will use the labels with the least amount of noise.

So, if we gave the Labeled OMR extractor a Header Label matching the text Over 18?, it would utilize the second set of "YES/NO" OMR labels.

FYI

Header labels can be vertically aligned or horizontally aligned with the OMR labels, depending on your document's structure.

You can establish a Header Label in one of two ways:

Using the Labeled OMR extractor's Header Extractor property.
Collecting an OMR Data Field's Header label when using Label Sets.

Using the Header Extractor PropertyUsing Label Sets

Using the Header Extractor Property

Imagine we want to to collect the question answered here. This is one of many "yes or no" style checkboxes on this document. We can use the following Header Label to help narrow down which group of "Yes/No" checkboxes we want to return. `a. Has the applicant completed the Cow Operating Training Program accredited by the National Bovine Accrediting Board?`
Without a Header Label, we're not collecting the correct value. This "Yes/No" checkbox pertains to some different value.
We will configure the Header Extractor to filter out false positive label groups. You can use whatever extraction method works best for your document set. We will use the List Match extractor (which is generally the most common for matching labels like this).
We've entered the full label identifying the "Yes/No" field we want to extract. Now the Labeled OMR extractor will have some additional context and use this as an anchor for the OMR label groups.
The Header Extractor's result will be required to find the OMR Labels. This will always be outlined in blue on the document. The nearest group of OMR labels will be used to return a value. Ultimately, this gets us the result we want, the "Yes" or "No" OMR value labeled by our Header Label.

Using Label Sets

If you're using Label Sets, all you need to do is collect a Data Field's Header label for your Document Types.

Navigate to the "Labels" UI.
Select a sample of the Document Type whose labels you want to collect.
Select the OMR Data Field.
Select the Header tab.
Collect the label from the document.
You must also create OMR values using the Data Field's List Values property and collect labels for them before proceeding.

The collected Header labels will be used in place of the Header Extractor.
The Header label be required to find the OMR labels.
- This will always be outlined in blue on the document.
The nearest group of OMR labels will be used to return a value.
Ultimately, this gets us the result we want, the "Yes" or "No" OMR value labeled by our Header Label.

@@ Line 462: / Line 462: @@
 # Remember, we said it was best practice to capture the full label when possible.
 # Let's go ''against'' our own best practice advice and only capture part of the OMR labels.
-#* i.e. <code>FARM</code> instead of <code>FARM (Agriculture)</code>
+#* i.e. <code>DOMESTIC</code> instead of <code>DOMESTIC(AGRICULTURE)</code>
 |valign=top|
-[[File:2021-labeled-omr-how-to-22.png]]
+[[File:2023 Labeled OMR - 2023 03 How To 01 Maximum Noise 01.png]]
 |-
 |valign=top|
@@ Line 470: / Line 470: @@
 # If we try to extract the field at this point, we will fail.  No value is extracted.
 #* Why?  Noise!
+# We have 15 noise characters between "FARM" and "SHOW".
+|valgin=top|
+[[File:2023 Labeled OMR - 2023 03 How To 01 Maximum Noise 02.png]]
+|-
+|valign=top|
 # The '''''Maximum Noise''''' property's default value is ''5''.
 #* This means there can be only 5 noise characters between the labels before Grooper tosses out the match.
-# We have more than 5 noise characters.  We have 15.
 |valgin=top|
-[[File:2021-labeled-omr-how-to-23.png]]
+[[File:2023 Labeled OMR - 2023 03 How To 01 Maximum Noise 03.png]]
 |-
 |valign=top|
@@ Line 481: / Line 485: @@
 # We have 15 noise characters.  So, if we change the '''''Maximum Noise''''' property to ''15''...
+|valgin=top|
+[[File:2023 Labeled OMR - 2023 03 How To 01 Maximum Noise 04.png]]
+|-
+|valign=top|
 # The OMR value extracts.
 |valing=top|
-[[File:2021-labeled-omr-how-to-24.png]]
+[[File:2023 Labeled OMR - 2023 03 How To 01 Maximum Noise 05.png]]
 |}
@@ Line 536: / Line 544: @@
 #* <code>a. Has the applicant completed the Cow Operating Training Program accredited by the National Bovine Accrediting Board?</code>
 |valign=top|
-[[File:2021-labeled-omr-how-to-25.png]]
+[[File:2023 Labeled OMR - 2023 04 How To 01 Using the Header Extractor Property 01.png]]
 |-
 |valign=top|
@@ Line 542: / Line 550: @@
 # Without a Header Label, we're not collecting the correct value.
 #* This "Yes/No" checkbox pertains to some different value.
+|valign=top|
+[[File:2023 Labeled OMR - 2023 04 How To 01 Using the Header Extractor Property 02.png]]
+|-
+|valign=top|
 # We will configure the '''''Header Extractor''''' to filter out false positive label groups.
 # You can use whatever extraction method works best for your document set.  We will use the '''''List Match''''' extractor (which is generally the most common for matching labels like this).
 |valign=top|
-[[File:2021-labeled-omr-how-to-26.png]]
+[[File:2023 Labeled OMR - 2023 04 How To 01 Using the Header Extractor Property 03.png]]
 |-
 |valign=top|
@@ Line 552: / Line 564: @@
 # Now the '''''Labeled OMR''''' extractor will have some additional context and use this as an anchor for the OMR label groups.
 |valign=top|
-[[File:2021-labeled-omr-how-to-27.png]]
+[[File:2023 Labeled OMR - 2023 04 How To 01 Using the Header Extractor Property 04.png]]
 |-
 |valign=top|
@@ Line 561: / Line 573: @@
 # Ultimately, this gets us the result we want, the "Yes" or "No" OMR value labeled by our Header Label.
 |valign=top|
-[[File:2021-labeled-omr-how-to-28.png]]
+[[File:2023 Labeled OMR - 2023 04 How To 01 Using the Header Extractor Property 05.png]]
 |}
 </tab>