2.80:OMR Reader (Result Post Processor)

From Grooper Wiki
Revision as of 10:02, 22 December 2023 by Dgreenwood (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

This article is about an older version of Grooper.

Information may be out of date and UI elements may have changed.

20252.80

LEGACY TECHNOLOGY DETECTED!!

The OMR Reader post processor is still configurable for Data Type extractors in Grooper. However, this is a largely outdated way of doing things as of version 2021.

Now, it is more likely you would use the Labeled OMR extractor type to accomplish the same end goal.

The OMR Reader post processor selected on a Data Type's property panel.

OMR Reader is a Post Processing option for Data Type extractors. It determines whether labeled checkboxes are checked or not and, if checked, outputs the label as its result.

About

Documents use checkboxes to make our life easier. They are particularly prevalent on structured forms. It gives the person filling out the form the ability to just check a box next to a series of options rather than typing in the information.

However, most of Grooper's extraction centers around regular expression, matching text patterns and returning the result. There isn't necessarily a character to match a checked checkbox. Regular expression isn't going to cut it to determine if a box is checked or not.

This is where OMR comes into play. OMR stands for "Optical Mark Recognition". OMR determines checkbox states. The basic idea behind it is very simple. First find a box. A box is just four lines connected to each other in a square-like fashion. If that box has a mark of some kind inside it, it is checked. If not, it's not. Checked (or marked) boxes, whether a checked "x" (), a checkmark (), or a check block (), while have more black pixels inside the box than an unchecked (or unmarked) one (). If the detected box has a high threshold of black pixels in it, it's checked (or marked). If not, it's unchecked (or unmarked).

A simple example would be a document asking a question and giving two boxes to check “Yes” or “No.” For example, see the portion of the document below asking if the applicant is a U.S. Citizen. “Yes” or “No” would be the labels. Either “Yes” or “No” would be the field's final result, depending on which box is checked.  In this case, "Yes".

The OMR Reader Post Processing option allows you to use a Data Type to use checkboxes to return data from a document.

In general, what you want to extract is the text of the checked label. The OMR Reader allows you to do just that. You will set up the Data Type to locate the text label. Grooper's OMR detection will determine if the box next to the label is checked. And, the label is returned as the Data Type's result.

Use Cases

Any document using checkboxes can take advantage of this functionality. There is a wide variety of use cases, including application forms, surveys, and questionnaires.

How To

Configure the Extractor

Checkboxes detail information in one of three ways, either:

  • You will have several checkboxes next to several label options, giving you a list of choices. Of this choices, you may choose only one
  • You will have several checkboxes next to several label options and you may choose multiple.
  • You will have a single checkbox and it really just matters whether the checkbox is checked or not.

OMR Reader has three corresponding Modes to account for this:

  • CheckOne
  • CheckMulti
  • Boolean

How your document is structured will inform which mode you choose. However, most of the extractor's configuration is the same regardless of which one you choose.


Prereqs - Box Detection

In order for OMR Reader enabled Data Types to return a result, it needs to be able to find a checkbox and it needs to be able to tell if that box is checked or unchecked. The checkbox locations and "check states" (checked or unchecked) must be saved to a page before extracting the value.

This information is saved to a Batch Page objects, "LayoutData.json" file during permanent or temporary image processing, using a Box Detection or Box Removal command.

This means you must execute an IP Profile with a Box Detection or Box Removal command as one of it's IP Steps.

  1. Here, we have an IP Profile titled "Layout Data".
  2. It has a Box Detection command as its first step.
  3. You can verify boxes are detected using the "Boxes" diagnostic image.
  4. Detected checked boxes are green.
  5. Detected unchecked boxes are red.

Then the IP Profile must be ran on the pages in the Batch using the Image Processing activity (for permanent image processing) or Recognize activity (for temporary image processing).

Extract the Checkbox Labels

The general purpose of the OMR Reader is to return the labels of boxes that are checked. First things first, we need our Data Type to locate and return the labels next to the checkboxes.

  1. Create and select a Data Type.
    • The one we're configuring here is named "OMR - (Check Multi) This estimate includes.
  2. Configure the Data Type to return the desired labels.
    • In this case, the labels are returned with a simple regular expression pattern. We are choosing to do this with the Pattern property.
    • However, labels can also be returned using child Data Format or Data Types or using a Referenced Extractor.
  3. Our goal is to determine which of the following three options are checked: Property Taxes, Homeowner's Insurance, or Other.

The goal of the extractor is to produce a result for each checkbox option. We have created a regular expression pattern that is an "or separated list". Each label followed by the vertical pipe character.

  1. We have entered the following pattern in our Value Pattern:
Property Taxes|
Homeowner's Insurance|
Other:
  1. All three labels next to the checkbox are returned as individual values.
  2. You may notice we're getting a lot more than just these three values in our results list.
    • That's actually ok in this case. None of these other results have boxes next to them. Because Labeled OMR is looking for results that have checkboxes next to them, these results will ultimately be thrown out.

Configure the Post Processor

Now, we will use the Post Processing property of this Data Type to return labels next to checked boxes.

  1. Select the Post Processing property.
  2. Choose OMR Reader from the dropdown list.

Verify Extraction

We will go ahead and test our extraction and see what results we get. However, before continuing, we will point out the Box Location and Max Distance properties. These properties behave as advertised. Box Location determines where the box is in relation to the label. The default of West assumes the box will be to the left of the label, changing it to East will assume the box is to the right, and so on. The Max Distance property sets the maximum space allowable between a box and a label. You expect check boxes to be fairly close to their label. If a box is on the other side of a page from a label, it typically does not pertain to that label at all. The default of 0.25in works well in most cases. However, it is editable.

  1. Press the "Test Single" button to refresh our extraction results.
  2. Instead of each label result coming in at 100% confidence, the label next to the checked box (Homeowner's Insurance) is returned with a 27% confidence.
  3. If we're only looking for the most confident result, this appears to be doing its job. The label next to the checked box is returned first, as the most confident result.

Not so fast. Let's check another document.

  1. Any combination of these boxes may be checked. One might be checked. They all might be checked. Even none might be checked!
  2. However, both checked labels are returned with the same confidence, here 30%.
  3. That is because the default Mode property of CheckOne assumes only one box may be checked. CheckOne isn't really the best Mode for this situation. We need to use a different Mode.

CheckMulti Mode

Since multiple boxes may be checked, we need to change the Mode property to CheckMulti. This will return all results as a concatenated string.

  1. Select the Mode property.
  2. Choose CheckMulti

Upon extraction now, the two results return as a single string value, "Homeowner's InsuranceOther:"

  1. If you wish, you may use the Separator String property to insert a character (or several characters) between each result.
  2. For example, here we set the Separator String's value to the pipe character (|), returning a pipe delimited list, "Homeowner's Insurance|Other:".

CheckOne Mode

As the name implies, the CheckOne Mode assumes only one check box is checked. It will only return a maximum of one result.

However, its Pattern is configured in much the same way. Here, we're looking for two options:

  1. the sentence indicating the lender "will allow" assumption of the loan or
  2. the sentence indicating the lender "will not allow" assumption of the loan.

Our regular expression is very similar to our CheckMulti option. We've created an "or separated list" using the pipe character (|) matching the text next to the two text boxes:

will allow|
will not allow

  1. Use the Post Processing property to enable OMR Reader.
  2. CheckOne is the default option for the Mode property.
  3. Labels next to checked boxes are returned as the most confident result.
    • Here we only have two options, "will allow" and "will not allow". "will allow" is returned at 100% confidence in this case because there are only two possible results.

Boolean Mode

The Boolean option will return a value of "True" or "False" depending on if the box is checked or not. Sometimes checkboxes aren't used to indicate your choice of a list of options, but instead a binary "yes/no" or "true/false" type response.

In this case, the checkbox indicates whether or not an escrow account is used for the loan. Our regex only needs to capture a single label next to the box. The first part of the sentence will put us right next to that check box:

will have an escrow account

  1. Use the Post Processing property to enable OMR Reader.
  2. Change the Mode property to Boolean.
  3. Since we're looking for a boolean value, rather than the text label returned a value of "False" is returned if the box is unchecked, "True" if it is.
  4. The output value can also be changed using the Value If Checked and Value If Unchecked properties.

Version Differences

Prior to version 2.80, this functionality would been performed using the "Data Element Profiles" tab of a Document Type and drawing "OMR Zones" around the checkboxes to read their check states. Grooper has moved away from "Data Element Profiles" in favor of configuring the functionality directly on Data Elements in a Data Model, using Value Extractors such as Labeled OMR or a Data Type using the OMR Reader result post processor.

Furthermore, OMR Reader gives OMR functionality to Data Type extractors where "Data Element Profiles" could only be configured to return OMR labels directly to a Data Model. Data Types can be used for any number of purposes in Grooper, not just for populating a Data Model. For example, an OMR Reader configured Data Type could be used to classify a Document Type. This functionality would not be possible prior to version 2.80.

See Also