Labeled OMR - 2021

From Grooper Wiki
Jump to navigation Jump to search
An example of checkboxes.

Labeled OMR is an extractor used to output OMR checkbox labels. It determines whether labeled checkboxes are checked or not. If checked, it outputs the label(s) as the result.

About

Asset 22@4x.png

You may download and import the file below into your own Grooper environment (version 2021). This contains a Batch with the example document(s) discussed in this tutorial and a Content Model configured according to its instructions.


Documents use checkboxes to make our life easier. They are particularly prevalent on structured forms. It gives the person filling out the form the ability to just check a box next to a series of options rather than typing in the information.

However, most of Grooper's extraction centers around regular expression, matching text patterns and returning the result. There isn't necessarily a character to match a checked checkbox. Regular expression isn't going to cut it to determine if a box is checked or not.

This is where OMR comes into play. OMR stands for "Optical Mark Recognition". OMR determines checkbox states. The basic idea behind it is very simple. First find a box. A box is just four lines connected to each other in a square-like fashion. If that box has a mark of some kind inside it, it is checked. If not, it's not. Checked (or marked) boxes, whether a checked "x" (), a checkmark (), or a check block (), while have more black pixels inside the box than an unchecked (or unmarked) one (). If the detected box has a high threshold of black pixels in it, it's checked (or marked). If not, it's unchecked (or unmarked).

A simple example would be a document asking a question and giving two boxes to check “Yes” or “No.” For example, see the portion of the document below asking if the applicant is a U.S. Citizen. “Yes” or “No” would be the labels. Either “Yes” or “No” would be the field's final result, depending on which box is checked.  In this case, "Yes".

1573055869908-200.png
Labeled-omr-about-02.png

In general, what you want to extract is the text of the checked label. The Labeled OMR extractor allows you to do just that.

First, you will set up an extractor to locate the text labels.

2021-labeled-omr-about-03.png

Then, Grooper's OMR detection will determine if there is a box next to the label, and whether or not that box is checked.

2021-labeled-omr-about-04.png

Last, if the label is checked, the label is returned as the extractor's result.

2021-labeled-omr-about-05.png

FYI

Labeled OMR has multiple extraction modes depending on how checkboxes behave on the document. There is also a Boolean mode to simply output "True" or "False" if a single checkbox is checked or not. We will discuss the different extraction modes further in the #How To section of this article.

How To

Assign the Extractor

The Labeled OMR extractor can be utilized in two ways:

  1. As a Value Reader's extractor type.
  2. As an object's extractor property configuration. For example:
    • As a Data Field's Value Extractor property's extractor configuration.
    • As a Data Type's Local Extractor property's extractor configuration.
    • As a Document Type's Positive Extractor property's extractor configuration.
    • And more!

Value Reader

The Labeled OMR extractor is one of the extractor types available to the Value Reader extractor object.

  1. This is a Value Reader object.
  2. For its Extractor Type property, Labeled OMR is selected.
  3. The Labeled OMR extractor's configuration is set up in this property grid.
  4. This Labeled OMR extractor is configured to determine which box is checked of three options on the document (DOMESTIC, FARM, or SHOW).
  5. The Value Reader returns the label with a checked box next to it (FARM on this document).

2021-labeled-omr-how-to-01.png

Extractor Property

You may also configure a Labeled OMR extractor when configuring an extractor property. Many Grooper objects have some kind of extractor property in their property grids. Labeled OMR is one of the options that can be selected as the extractor type.

For example, Data Field objects have a Value Extractor property, which collects a result when the Data Model is extracted during the Extract activity.

  1. This is a Data Field object.
  2. For its Value Extractor property, Labeled OMR is selected.
  3. The Labeled OMR extractor's configuration is set up in the Value Extractor property's sub-property grid, OR can be configured using the Extractor Editor window by pressing this ellipsis button.
  4. This Labeled OMR extractor is configured to determine which box is checked of three options on the document (DOMESTIC, FARM, or SHOW).
  5. When the Data Field is collected during the Extract activity, the label with a checked box next to it (FARM on this document) is returned.

2021-labeled-omr-how-to-02.png

Configure the Extractor Part 1: OMR Labels

The first part of the Labeled OMR extractor's configuration is label extraction. Labels can be collected in one of three ways:

  • Using the Label Extractor property.
  • Using the List Values settings of a Data Field.
  • Collecting labels for the OMR labels when using Label Sets.
    • When we get to this point, this article will presume you have some familiarity with Label Sets and the Labeling Behavior functionality. For more information on Label Sets please visit the Label Sets article.


To illustrate this, we will configure extraction for a single Data Field, detailing how each of these three different methods get a result.

  1. Our example document is an "Application For Cow Ownership" form.
  2. This form lists the "Type of Cow Applied for" using checkboxes
    • Either "DOMESTIC", "FARM", or "SHOW".
  3. We will use this Data Field named "13. Type of Cow" to collect the choice indicated on the document.
  4. We have assigned the Data Field's Value Extractor to Labeled OMR.


FYI

Be aware the Mode property is very important to Labeled OMR's configuration.

For the time being, we will use the default CheckOne mode. This will presume for multiple labeled boxes only one may be checked. We will discuss the different OMR modes in the #Configure the Extractor Part 2: OMR Modes section of this article.

2021-labeled-omr-how-to-mode-fyi.png

2021-labeled-omr-how-to-03.png

At this point, the Labeled OMR extractor is totally unconfigured. Next, we will detail each of the three different ways to extract OMR labels. While the configuration is slightly different, the goal is the same: Locate text labels next to checkboxes. Each method has its own strengths and weaknesses, giving you flexibility in how you locate the OMR labels based on your documents' circumstances.

Using the Label Extractor

Moderate to high level of work up front. High flexibility in configuration options.

One way to locate OMR labels is by configuring the Labeled OMR extractor's Label Extractor property. In some ways, this is the most "effort intensive" of the three options. It will require you to configure an extractor to return each of the labels for the set of OMR checkboxes. This means a lot of manual configuration of property grids and/or external extractor objects, depending on the complexity of your documents.

However, it is also extremely reliable with a huge amount of flexibility. Since you configure an extractor to return the labels, you have all the extraction tools available to Grooper's suite of extraction types and extraction logic.

When other methods can't get the job done, configuring the Label Extractor property will be your go-to method to locate OMR labels.


For this method OMR labels are located using an extractor's results.

  1. This extractor is configured using the Label Extractor property.
  2. Select the extractor type you wish to configure using the dropdown list.
    • Most often, you will use the List Match extractor to return labels next to OMR checkboxes. We will select List Match for this exercise.
    • However, you can use whatever extractor types and techniques you choose, as long as the extractors end results are your OMR labels.

2021-labeled-omr-how-to-04.png


  1. We will configure the extractor in the Extractor Editor by pressing the ellipsis button at the end of the Label Extractor property.
  2. Regardless of the specific extractor you choose to configure, your goal will be the same. Return one result for each individual label in the group of checkboxes.
    • These will supply Labeled OMR with data instances that should be have checkboxes nearby. Then, Grooper will look for checkboxes around the data instances, determine which ones are checked, and return whichever data instance has a checked box next to it.
    • In our case, we're wanting to return the labels "DOMESTIC (Home)" "FARM (Agriculture)" and "SHOW (Beauty)".

2021-labeled-omr-how-to-05.png


Essentially, we want to return a list of OMR labels. The List Match extractor is well-suited for this task.

  1. Using the Local Entries list type the list of OMR labels.
    • DOMESTIC (Home)
    • FARM (Agriculture)
    • SHOW (Beauty)
  2. Ensure the extractor returns data instances next to the OMR checkboxes on the document.
  3. The extractor should return one result for each individual label.


FYI

It is generally preferable to capture the full label when possible.

For example, we would want to collect "DOMESTIC (Home)" rather than "DOMESTIC". We can always translate our result's output later (which we will do shortly).

  • Why is this important? It has to do with how Grooper isolates and filters label groups using "noise characters". We will discuss this further in the #Maximum Noise section of this article.

2021-labeled-omr-how-to-06.png


That's it! Grooper will now analyze the pixels around the labels' data instances, determine if anything around it is a checkbox, determine their checkbox states (checked or not checked), and return the data instance (in other words, the OMR label) next to a checked box.

  1. Now that the Label Extractor is configured, the Labeled OMR extractor can find OMR labels.
  2. Grooper determined there was a checked box next to the label FARM (Agriculture)
  3. That label is returned, populating the Data Field.

2021-labeled-omr-how-to-07.png

Tips and Tricks: Translating Output

Often, it is the case the label on the document is not exactly what you want to collect for your data set. You may want to adjust the output value in one way or another. For example, you may want to collect the value "FARM" instead of the full text "FARM (Agriculture)".

This is easily done when using the List Match extractor for your Label Extractor.


  1. We're going to accomplish this by editing our Label Extractor's List Match configuration, using the Extractor Editor window.
  2. Navigate to the "Properties" tab.
  3. Under the Output properties, change the Translate property to True.

2021-labeled-omr-how-to-08.png


List translations are made using the equals sign = using the following syntax:

Extracted Value=Translated Output


  1. We've translated our OMR labels to exempt the text in parenthesis.
    • DOMESTIC (Home)=DOMESTIC
    • FARM (Agriculture)=FARM
    • SHOW (Beauty)=SHOW
  2. We still match the full result, meaning we capture the full label for the data instance.
    • This is ideal for OMR detection. Grooper will have an easier time "figuring out" where checkboxes are in context to a label if it has the full label to work with.
  3. However, the output is translated to the format we want.

2021-labeled-omr-how-to-09.png


This gives us the best of both worlds!

  1. Grooper finds the label next to the checked box, and returns the formatted value we want.

2021-labeled-omr-how-to-10.png

Using a Data Field's List Values

A simple solution for the most simple cases.

This next method is fantastic... if it works for you. It is extremely simple to set up, but has the most limitations. However, for straightforward OMR extraction, it is highly effective with little setup involved.

This method uses a Data Field's List Values settings to function. Typically, the List Values property is configured to aid in human review during a Review step in a Batch Process. It allows you to enter a list of values the user can pick from during review. Labeled OMR has special interactivity with the List Values property. If you do not configure the Label Extractor, Grooper will check to see if any List Values have been entered. If so, it will attempt to match the items in List Values list as the OMR labels.

  • This could be a knock-on benefit in that you might want to configure a List Values list for OMR fields regardless to make your document reviewer's work easier (and potentially more accurate).
  • What is a group of checkboxes but a list of values to select from on a document? If human review is part of your process, you might be using List Values to give your document reviewers a selection list of checkbox options anyway. If it turns out those List Values match the OMR labels anyway, great! There's no need to configure a Label Extractor for Labeled OMR in that case. Two birds. One property configuration.


For this method, OMR labels are collected using the List Values settings of a Data Field.

  1. First, select the Data Field.
  2. Using the Value Extractor property, select Labeled OMR.

2021-labeled-omr-how-to-11.png


  1. Scroll down to the bottom of the Data Field's property grid and expand the List Values property.
  2. Select the Local Entries entries property and press the ellipsis button at the end.
  3. This will bring up a List Editor window.
  4. The goal here is to make a list of the OMR labels.
  5. Type each OMR label on one line in the List Editor.

2021-labeled-omr-how-to-12.png


With the Labeled OMR extractor's Label Extractor property left unconfigured, Grooper will use the List Values in its place.

  1. We've added the three possible OMR labels to the Local Entries list of the List Values property.
  2. Grooper determined there was a checked box next to the label FARM (Agriculture)
  3. That label is returned, populating the Data Field.

2021-labeled-omr-how-to-13.png


As an added benefit for a document review step, the normal functionality of the List Values property is also implemented.

  1. The review user will be present with a dropdown selection list containing the OMR labels you added to the List Values list.

2021-labeled-omr-how-to-14.png

This method's strength and limitation lies in its simplicity. It will not work for every situation.

  • If you cannot extract an OMR label by matching it to a simple list item (and instead must rely on more advanced extractors or techniques to match an OMR label), the List Values method will not work.
  • If you need to translate an OMR label to a different output value, the List Values method is inferior. Using a Label Extractor or Label Sets are better suited for formatting OMR labels to the value you want to output.

Using Label Sets

Harness the power of Label Sets. Simple set up. Easy output translation.

Grooper's Label Set functionality provides powerful document extraction and classification capabilities by leveraging the prevalence and utility of field labels. Labeled OMR is a "Label Set aware" extractor. OMR labels can be collected for a Data Field's set of labels and used at time of extraction in place of a Label Extractor.

If you are using Label Sets in your solution, this approach will most likely be the one for you. The setup is fairly simple, and translating/formatting your output is a breeze.

  • This article presumes you have some awareness of Label Sets and the Labeling Behavior. For more information, please visit the Label Sets article.


For this method, OMR labels are located using label's collected in a Document Type's Label Set. The only "trick" to this method is giving Grooper some additional information so that you can collect labels for the OMR values.

  1. We've already enabled the Labeling Behavior on our Content Model.
  2. We've navigated to the "Labels" tab to collect labels for our Data Field.
  3. The "13. Type of Cow" Data Field is as of yet unconfigured.
  4. Notice we only have three label types available for capture.
    • The Header label.
    • The Footer label.
    • The Static label.
  5. Which of these do we use to capture our three OMR labels ("DOMESTIC (HOME)" "FARM (Agriculture)" and "SHOW (Beauty)")?
    • None of them!
    • Keep in mind OMR fields are going to behave differently on a document. We have three labels here. Another document might have ten. Or twenty. Or just a single yes/no style checkbox. The label types we have here just aren't going to cut it.

2021-labeled-omr-how-to-15.png


Grooper doesn't know how many checkboxes and corresponding labels you have until you give it some more information. We need to collect individual labels for each checkbox. We must define each OMR label first using the Data Field's List Values property.

  1. Select the Data Field.
  2. Scroll to the bottom of the Data Field's property grid and expand the List Values property.
  3. Select the Local Entries property and press the ellipsis button at the end.
  4. This will bring up a List Editor window.
  5. The difference between using Label Sets and using List Values alone is you can enter whatever you want for the ORM labels' values here.
    • Imagine we have a requirement in whatever backend system stores these values. Each result can only be "DMST" "FARM" or "SHOW". Well, "DMST" doesn't even appear on the document. But, it doesn't matter. Grooper isn't going to use these values to match against the document. They're merely providing a list of values for which we will capture a label later.

2021-labeled-omr-how-to-16.png


Now, we will be able to collect labels for each item we've added to the list.

  1. Navigate back to the Content Model
  2. Navigate to the "Labels" tab.
  3. Select the document assigned the Document Type whose labels you want to collect.
  4. Select the Data Field whose OMR values you've added to its List Values list.
  5. Check it out! We have now labels we can collect. You are able to capture labels for each value added to the list.
  6. Capture the full label for the corresponding OMR value.
    • Grooper will use the the captured label to locate the checkbox and determine its state, but output the value in the List Values list.

2021-labeled-omr-how-to-17.png


All that's left is to assign the Data Field's extractor.

  1. Select the Data Field in the Node Tree.
  2. Change the Value Extractor property to Labeled OMR.
  3. The extractor will use the collected labels in the Label Set to match the OMR labels, find nearby checkboxes, and determine their check states.
  4. The output value will be the corresponding value we entered in the Value List' for the label.

2021-labeled-omr-how-to-18.png

Click here to return to the top of this section

There is an order of operations if you configure multiple OMR label extraction/collection methods.

  1. Label Extractor
    • If the Label Extractor property is configured, it always takes priority and will be used to locate OMR labels.
  2. Label Sets
    • If the Label Extractor property is not configured, collected labels in the Document Type's Label Set will take priority.
  3. List Values
    • If the Label Extractor property is not configured, and no labels are collected, the List Values list items will be used to locate OMR labels.

Maximum Noise

The concept of character noise is important to how Grooper isolates and filters out OMR label groups. A noise character is any alphanumeric character (not punctuation characters) that falls between OMR labels. Typically, OMR labels are grouped close together on a document with little to no other text between the labels. Grooper will filter out label matches with large numbers of characters between them.

For example, take these checkboxes using the labels "True" and "False".

Yes, the labels are nearby checkboxes, but those same labels exist in the sentences to the right. How does Grooper distinguish between the OMR labels and those same words otherwise popping up on the document? Noise.

2021-labeled-omr-noise-01.png

First, Grooper will draw a box around the boundaries of each OMR label.

2021-labeled-omr-noise-02.png

Then, Grooper will dropout the labels and count the characters remaining within each boundary.

These remaining characters are "noise". Grooper will count each character (alphanumeric only) to establish a noise count between potential OMR labels in a group.

2021-labeled-omr-noise-03.png

In cases where there are multiple label hits on the document, Grooper will use whichever OMR label group has the least amount of noise.

2021-labeled-omr-noise-04.png

The Maximum Noise property dictates how much noise is allowable between OMR labels in a group. By default, only 5 noise characters are allowed. This means if there are more than 5 noise characters between your OMR labels, Grooper will always toss out your OMR labels. However, this can be adjusted.


Let's go back to our previous example and take another look the Label Extractor we used to locate OMR labels for the "Type of Cow Applied for" field.

  1. We used a List Match extractor to match these OMR labels.
  2. Remember, we said it was best practice to capture the full label when possible.
  3. Let's go against our own best practice advice and only capture part of the OMR labels.
    • i.e. FARM instead of FARM (Agriculture)

2021-labeled-omr-how-to-22.png


  1. If we try to extract the field at this point, we will fail. No value is extracted.
    • Why? Noise!
  2. The Maximum Noise property's default value is 5.
    • This means there can be only 5 noise characters between the labels before Grooper tosses out the match.
  3. We have more than 5 noise characters. We have 15.

2021-labeled-omr-how-to-23.png


This property is, however, adjustable to allow for a higher noise threshold.

  1. We have 15 noise characters. So, if we change the Maximum Noise property to 15...
  2. The OMR value extracts.

2021-labeled-omr-how-to-24.png

Header Labels

A "Header label" is a label that helps identify fields, sections or columns of information on a document. Header labels help distinguish the type of information listed on a document. They can be a huge help in identifying OMR checkbox labels that are very generic.

For example, "yes or no" style checkboxes are common on many forms, often with multiple questions asking "Yes" or "No" in response to a series of questions. On several (if not all) voter registration applications in the United States, you will see two questions.

  • Are you a US Citizen?
  • Will you be over 18 at the time of the election?

Often these are styled as OMR checkboxes with "yes or no" checkboxes next to them.

The problem is the OMR labels are the same for both fields ("YES" and "NO"). How is Grooper going to disambiguate between the fields? How will know if you want to collect a value for one question or the other?

2021-labeled-omr-header-ex-01.png

A Header Label will do the trick! In cases where you have multiple OMR label hits, Grooper will use the labels with the least amount of noise.

So, if we gave the Labeled OMR extractor a Header Label matching the text Over 18?, it would utilize the second set of "YES/NO" OMR labels.

2021-labeled-omr-header-ex-02.png

FYI

Header labels can be vertically aligned or horizontally aligned with the OMR labels, depending on your document's structure.

2021-labeled-omr-header-ex-03.png

You can establish a Header Label in one of two ways:

  1. Using the Labeled OMR extractor's Header Extractor property.
  2. Collecting an OMR Data Field's Header label when using Label Sets.

Using the Header Extractor Property


  1. Imagine we want to to collect the question answered here.
    • This is one of many "yes or no" style checkboxes on this document.
  2. We can use the following Header Label to help narrow down which group of "Yes/No" checkboxes we want to return.
    • a. Has the applicant completed the Cow Operating Training Program accredited by the National Bovine Accrediting Board?

2021-labeled-omr-how-to-25.png


  1. Without a Header Label, we're not collecting the correct value.
    • This "Yes/No" checkbox pertains to some different value.
  2. We will configure the Header Extractor to filter out false positive label groups.
  3. You can use whatever extraction method works best for your document set. We will use the List Match extractor (which is generally the most common for matching labels like this).

2021-labeled-omr-how-to-26.png


  1. We've entered the full label identifying the "Yes/No" field we want to extract.
  2. Now the Labeled OMR extractor will have some additional context and use this as an anchor for the OMR label groups.

2021-labeled-omr-how-to-27.png


  1. The Header Extractor's result will be required to find the OMR Labels.
    • This will always be outlined in blue on the document.
  2. The nearest group of OMR labels will be used to return a value.
  3. Ultimately, this gets us the result we want, the "Yes" or "No" OMR value labeled by our Header Label.

2021-labeled-omr-how-to-28.png

Using Label Sets

If you're using Label Sets, all you need to do is collect a Data Field's Header label for your Document Types.

  1. Navigate to the "Labels" UI.
  2. Select a sample of the Document Type whose labels you want to collect.
  3. Select the OMR Data Field.
  4. Select the Header tab.
  5. Collect the label from the document.
  6. You must also create OMR values using the Data Field's List Values property and collect labels for them before proceeding.

2021-labeled-omr-how-to-29.png


  1. The collected Header labels will be used in place of the Header Extractor.
  2. The Header label be required to find the OMR labels.
    • This will always be outlined in blue on the document.
  3. The nearest group of OMR labels will be used to return a value.
  4. Ultimately, this gets us the result we want, the "Yes" or "No" OMR value labeled by our Header Label.

2021-labeled-omr-how-to-30.png

Configure the Extractor Part 2: OMR Modes

After you locate the OMR labels you must determine how checkboxes behave on your document and choose an OMR mode.

Checkboxes detail information in one of three ways, either:

  • You will have several checkboxes next to several label options, giving you a list of choices. Of these choices, you may choose only one
  • You will have several checkboxes next to several label options and you may choose multiple.
  • You will have a single checkbox and it really just matters whether the checkbox is checked or not.

Labeled OMR has three corresponding Modes to account for this:

  • CheckOne
  • CheckMulti
  • Boolean

How your document is formatted informs how the checkboxes behave, which will inform which mode you choose.

CheckOne

The CheckOne mode is the default OMR mode. This mode presumes for however many checkboxes are present on the document, only one box may be checked. Or, in other words, only one choice may be selected on the document.


The checkbox values we extracted in the #Configure the Extractor Part 1: OMR Labels section of this article all used the default CheckOne mode.

  1. The Mode property is set to CheckOne.
  2. Only one of our three OMR labels can be selected on the document.
    • Either the "Type of Cow Applied for" is going to be "DOMESTIC" "FARM" or "SHOW"
  3. A single value is returned for the single box checked on the document.

2021-labeled-omr-how-to-19.png

CheckMulti

The CheckMulti mode presumes for multiple checkboxes any number of them may be checked. Or, in other words, multiple choices may be selected on the document.

For example, this portion of the "Application for Cow Application" document has a "Cow Familiarity" section. The applicant would check each box that applies to their familiarity with cows (and cow-like lifeforms). So, multiple selections may be made. One box could be checked. Five boxes could be checked. One. Maybe even none.

2021-labeled-omr-checkmulti-example.png


  1. To use CheckMulti mode, change the Mode property to CheckMulti.
  2. Any number of our OMR labels may be selected.
    • In the case of this document, five are selected.
  3. This will return a concatenated string of every OMR label with a checked box next to it.
    • FYI: We used a List Match extractor for the Label Extractor and translated each label to its bullet character (i.e. "A. Know what a mammal is" was translated to "A").
  4. Each label result is separated with a separator string, defined by the Separator String property.
    • This is a single space character by default. However you can change it to any character you want.
    • For example if you entered a comma (,), this would create comma separated list of OMR labels (i.e. A,B,C,G,H)

2021-labeled-omr-how-to-20.png

Boolean

The Boolean mode presumes there is a single checkbox and you're trying to determine if it is checked or not. For the CheckOne and CheckMulti modes, the goal is to extract information about which of a set of boxes are checked, returning OMR labels as the value. For the Boolean mode, you don't care about the label. Instead you want a "True/False" value depending on if the box is checked (True) or not (False).

For example, on our "Application for Cow Application" there is an "Electronic Correspondence Option" field on the back page. Checking the box indicates the applicant wants to receive electronic mail. Leaving it unchecked indicates they want to receive paper mail. The Boolean mode will allow us to extract the value "True" if the box is checked or "False" if not.

2021-labeled-omr-boolean-example.png


  1. To use Boolean mode, change the Mode property to Boolean.
  2. We want to determine if a single box next to an OMR label is checked or unchecked.
    • In the case of this document, it is unchecked.
  3. The Labeled OMR extractor will return "True" or "False" depending on the box's checkstate.
    • If the box is checked, the Labeled OMR extractor will return True.
    • If the box is unchecked, the Labeled OMR extractor will return False.
  4. Optionally, you may change the output value using the Value If Checked and/or Value If Unchecked properties.

2021-labeled-omr-how-to-21.png

Additional Information

The Labeled OMR extractor will detect checkboxes in one of two ways:

  1. Using Layout Data
  2. Using circular box detection

Layout Data Box Detection

Prior to Grooper version 2021, Layout Data was always required. This means checkbox locations and their check states would need to first be detected by a box detecting IP Step (either Box Detection or Box Removal) in an IP Profile. When the IP Profile is executed, this data is stored in a page's "Grooper.LayoutData.json" file and used at time of extraction to find checkboxes near OMR labels and determine whether or not the box is checked.

FYI

Layout Data is saved when certain IP Commands execute in an IP Profile, such as Box Removal or Box Detection. This occurs when the IP Profile is applied during the following Grooper Activities:

  • Image Processing when an IP Profile is applied for permanent image cleanup.
  • Recognize when an IP Profile is assigned to an OCR Profile for temporary image cleanup prior to OCR.
  • Recognize when an IP Profile is assigned to the Alternate IP property to obtain Layout Data from native-text PDF documents.


  1. Box Detection is a box detecting IP Step.
  2. Its property grid is configured to define characteristics of the checkboxes on our documents, such as their size and aspect ratio, and any image preprocessing done before detecting boxes.
  3. When the box detecting step executes, checkbox physical locations are stored in the document/page's Layout Data file, along with their check states.
    • Seen in the "Boxes" diagnostic image. Checked boxes are green. Unchecked boxes are red.

2021-labeled-omr-how-to-31.png

In version 2021, the Labeled OMR extractor's functionality was expanded to allow for circular box detection at extraction time without Layout Data. Labeled OMR performs a box detection pass during extraction, analyzing "box-like" shapes nearby the OMR labels. Due to this, it is now also possible to use Labeled OMR to extract from rectangular boxes at extraction time as well (We will see this in the next section).

However, this is not preferred.

If you can collect Layout Data for rectangular checkboxes, you should. Grooper will always prioritize Layout Data over "circular box detection". Furthermore, you have much more control and customizability as to how rectangular checkboxes are detected using the Box Detection (or Box Removal) property grid.

Circular Box Detection (Radio Buttons)

Prior to version 2021, Grooper could only perform OMR using square or rectangular checkboxes. Circular checkboxes, also known as "radio buttons" could not be detected. This is because only square checkboxes can be detected and saved to to a document's Layout Data.

  • There is only Box Detection and Box Removal. There is no "Circle Detection" image processing command than can be applied globally to a page. What looks an awful lot like an unchecked radio button? The letter "O". You'd end up with a lot of false positive hits on O's and zeroes and other circular artifacts that are not circular checkboxes.


In version 2021, if no checkboxes are present in the document/page's Layout Data near the OMR labels, the Labeled OMR extractor will perform circular box detection at extraction time.

  1. Even though these are radio buttons, Grooper is still able to detect which button is selected and return a result.
  2. The Labeled OMR extractor analyzes the pixels nearby extracted labels to determine if something is a checkbox, and whether or not it is checked.

2021-labeled-omr-how-to-32.png


FYI

If you're troubleshooting Labeled OMR's box detection, you may find the "Diagnostics" tab helpful.

This will contain an execution log as well as various diagnostics images detailing how Grooper filtered out the OMR labels, processed the area nearby where checkboxes are expected to be, and ultimately detected the OMR checkboxes.

2021-labeled-omr-how-to-33.png


Because Labeled OMR has this circular checkbox detection functionality, it can and will also detect rectangular detection at time of extraction, if Layout Data is not present. (However, if you can detect checkboxes using the traditional Layout Data method, it is still generally preferable to do so.)

  1. This document does not have Layout Data collected.
  2. It does return the correct result.
  3. However, note the "Confidence" score is quite low, 50%.
    • This is because the extractor found the box using circular box detection. What this is saying is the checkbox is roughly 50% similar to a perfect circle.
    • The "at extraction" box detection seen here is really geared towards circular box detection. While it can detect rectangular boxes, it is not its primary function.
    • Box Detection image processing commands on the other hand are designed to detect rectangular boxes and should be used as the the primary means of locating rectangular boxes, when possible.

2021-labeled-omr-how-to-34.png

Labeled OMR is the only OMR based extractor type that will function without Layout Data.

Both Ordered OMR and Zonal OMR must have Layout Data present to function.

Version Differences

Radio Button Detection (2021)

In version 2021, the Labeled OMR extractor's functionality was expanded to allow for radio button extraction. Prior to 2021, Grooper could only perform OMR using square or rectangular checkboxes. In 2021, Labeled OMR is able to analyze checkboxes at the time of extraction, granting it the ability to detect radio buttons.

  • This also increased the extractor's functionality in that Layout Data is no longer strictly required for the extractor to function. It will use that Layout Data if present (which is preferred when possible), but it can now analyze the image at time of extraction if Layout Data is not present.

Extractor Expansion (2021)

Prior to version 2021, the Labeled OMR extractor type was only accessible using the Value Extractor property of certain objects, such as Data Fields.

In version 2021, all extractor types were expanded to the various extractor properties of all objects. This allows the Labeled OMR extractor to be utilized in ways never before possible.

  • For example, a Data Type extractor can now use Labeled OMR as its Local Extractor's extractor type. Prior to 2021, you could not use Labeled OMR to extract OMR labels using an extractor object.

Labeled OMR Introduction (2.90)

In version 2.80, Labeled OMR is referred to as Anchored OMR. The two features are configured and function similarly.

Prior to version 2.80, this functionality would have been performed using the "Data Element Profiles" tab of a Document Type and drawing "OMR Zones" around the checkboxes to read their check states. Grooper has moved away from "Data Element Profiles" in favor of configuring the functionality directly on Data Elements in a Data Model, using extractor types such as Labeled OMR.