Key-Value Pair (Collation Provider)

From Grooper Wiki
Jump to navigation Jump to search
Kvp 1.png

Key-Value Pair is a Collation Provider for Data Type extractors. It uses the layout relationship between a key and a value on a document to return a result.

Key-Value Pair collation is one of the most commonly used Collation Providers. It provides an excellent way to extract data when a value exists next to a label on a document, whether next to it horizontally, vertically, or even in a "right-to-left & top-to-bottom" text flow.


About

The Key-Value Pair Collation Provider utilizes the spatial relationship between two related extractor results to return a single result, typically looking for a piece of data (the value) next to a label (the key).

For structured documents, it is common for a piece of data to be identified by some sort of label, usually to the left of it, or above it.

In these images, the field label, highlighted in blue, identifies the field's value, highlighted in yellow. We use this kind of labeling relationship to identify data on documents all the time. The Key-Value Pair Collation Provider is perfectly suited to use these labeling relationships.

Key-Value Pair collated Data Types (often just referred to as Key-Value Pairs) collate the results of two extractors, a "key extractor" and a "value extractor". The "key extractor" will locate the label (or whatever context is being used to return the data you want). The "value extractor" will return all possible values matching the data you want to return.

Once collated, the Key-Value Pair will return the closest value to the key, according to the assigned Layout Settings (The top image uses a Horizontal Layout because the labels are aligned next to each other horizontally. The bottom uses a Vertical Layout).

Key-Value Pair 01.png

Key-Value Pair collation also has applications in unstructured document processing. Unstructured documents convey information in paragraphs and sentences more than they do with structured fields. Because of this, the value may not be horizontally or vertically aligned, but somewhere before or after a labeling key in the text flow.

For these situations, the Flow Layout can be used, which will use the relationship between the key and the value in the text data's right-to-left and top-to bottom text flow.

A Key-Value Pair could be build to extract the driver name (highlighted here in yellow), using the phrase "driver's name" in the text flow before it.

Key-Value Pair 02.png

How To

Create A Key-Value Pair Extractor

If you would like to follow along with this tutorial, you can download the zip file below and import it into your Grooper repository (version 2.90).

Create a Data Type

Data Type extractors use Collation Providers to combine, filter, or otherwise manipulate extraction results. Collation Providers are set using the Data Type's Collation property.

So, the very first thing to do is create a Data Type. Here, we are creating the Data Type in the Local Resources folder of a Content Model.

  1. Right-click the location where you are adding the Data Type, here the Local Resources folder.
  2. Mouse over "Add" and select "Data Type".
  3. Name the Data Type. Here we are naming it "KVP - Home Phone"
    • Visit the Asset Management article for our best practices guide to naming your extractors.
  4. Press the "OK" button to create it.

Key-Value Pair - Grooper Screenshots 01.png

Create the Key Extractor

Key-Value Pair extractors must have exactly two extractors, a "Key Extractor" and a "Value Extractor". The "Value Extractor" is ultimately the value the Key-Value Pair returns. The "Key Extractor" is how you find the value. It's result will be used as a positional anchor to find the value. Our goal with the document seen here is to differentiate between the various "home phone" numbers from the "cell phone" numbers. So, our key extractor simply needs to find the label, "home phone".

  1. Add the "Key Extractor" to the Data Type.
    • Here, we've added a child Data Format to the Data Type and named it "KEY". However, the "Key Extractor" could also be a child Data Type, a referenced extractor, even the parent Data Type's internal Pattern.
  2. Configure the "Key Extractor" to return the key result you're looking for.
    • Here, we're looking for the label "home phone". We simply type "home phone" as the value pattern.
  3. Notice, we get three results. That's ok! The Key-Value Pair collation settings will help us narrow down the result ultimately returned.

Key-Value Pair - Grooper Screenshots 02.png

Create the Value Extractor

For Key-Value Pair extractors, the "Value Extractor" is the extractor looking for the data you ultimately want to return. Here, we're looking for a home phone number. So, we simply need an extractor that finds phone numbers

  1. Add the "Key Extractor" to the Data Type.
    • Here, we've added a child Data Format to the Data Type and named it "VALUE". However, the "Value Extractor" could also be a child Data Type or a referenced extractor.
    • But not the parent Data Type's internal Pattern. Key-Value Pair collation requires two extractors, the "Key Extractor" and "Value Extractor". The order in which the parent Data Type's extractors execute (or "fire") matters a great deal. The first one to execute is always the "Key Extractor". The second is always the "Value Extractor". A Data Type's internal Pattern always executes first. Hence, if it is used as one of the two extractors, it will always be the "Key Extractor". Extractor execution order of operations is as follows:
    1. Internal Pattern
    2. Child Extractor (In order from top to bottom in the Node Tree)
      • Here, since the child Data Format named "VALUE" is below the Data Format named "KEY", the Data Format named "VALUE" is the "Value Extractor". It doesn't have anything to do with its name.
    3. Referenced Extractor (In order from top to bottom in the Referenced Extractor list)
  2. Notice, we get six results, all the phone numbers on the page. That's totally fine (Indeed, it's what we want). We will narrow down which specific phone number we're looking for using the Key-Value Pair collation settings.

Key-Value Pair - Grooper Screenshots 03.png

Set the Collation Provider

  1. Navigate back to the parent Data Type.
  2. Under Output, select the Collation property.
  3. Using the dropdown list, select Key-Value Pair from the list of Collation Providers.

Once you select Key-Value Pair, you will not see the results list change. It will still appear as if the two child extractors' results are being returned one by one (like the Individual Collation Provider). Some Collation Providers, such as Key-Value Pair, require some configuration before their results are collated. Specifically, you must choose which Layout method is used.

Key-Value Pair - Grooper Screenshots 04.png

Configure the Layout Settings

They Layout method will define how the "Key Extractor" and "Value Extractor" results are spatially related to each other. Is the key above the value? Is it to the left of it? The right? Configuring these settings will dictate where you expect to find the value in relation to the key.

All Collation Providers have their own set of configurable properties, including Key-Value Pair.

  1. To view the Key-Value pair properties, either double-click the Collation property or single click the arrow to the left of the property.
  2. To select and configure a Layout enable one of the three Layout Providers.
    • The Layout may be:
      • Horizontal
      • Vertical
      • Flow

Key-Value Pair - Grooper Screenshots 09.png

FYI It is possible to enable multiple Layouts and use multiple layouts on a single Data Type, but do so with caution.

If there is only a single key found by the "Key Extractor" (This is often the case), only a single result will return, using a single Layout. Be careful when enabling multiple Layouts as there is a certain order of operations when it comes to which layout is used first. It is as follows: First, the Horizontal Layout's result is returned. If there is no result from that layout, then the Flow Layout's result is returned. Last, if no other layouts produce results, the Vertical Layout's result is returned.

If there are multiple keys, this can get even more complicated with each key using its own Layout, according to this order of operations.

You may find it easier or more prudent to create a separate Key-Value Pair collated Data Type for each Layout as children of a parent Data Type. You then have more control of which result is returned via the Order By property.

The Horizontal Layout

  1. View the Key-Value Pair configuration properties by expanding the the Collation property.
  2. Select Horizontal Layout.
  3. Change the property from Disabled to Enabled.

This will look for results of the "Value Extractor" and only return the closest horizontally to the right of the "Key Extractor's" result. None of the other phone numbers on the page are horizontally aligned with the key "home phone'. So, only a single result is returned.

Key-Value Pair - Grooper Screenshots 05.png

The Vertical Layout

  1. View the Key-Value Pair configuration properties by expanding the the Collation property.
  2. Select Vertical Layout.
  3. Change the property from Disabled to Enabled.

This will look for results of the "Value Extractor" and only return the closest vertically, below the "Key Extractor's" result. But... Something is not right here. What we want to extract is the phone number underneath the "home phone" label directly below it, as seen in the box labeled "Vertical Layout" on the document. But, we're also getting another result.

4. The problem is, we're picking up the wrong key. The key in the "Horizontal Layout" box uses a horizontal alignment. However, there just happens to be a number vertically bellow this key.

The "Key Extractor" is set up to return "home phone", which appears three times on this document. The first time it appears is at the time of the page. If you draw a vertical line down from that label, sure enough, there is a phone number there, it's just using the wrong label.

Key-Value Pair - Grooper Screenshots 06.png

Resolving Common Problems

This could coincidentally return the right value you're looking for, but more often than not, this will produce false positive results. For example, here the "Home Phone" key travels down to the middle of the page and finds a cell phone number in another section of the document. Typically, keys are much closer to their values than what we see here. We need a way to restrict the space between the two extractor results to toss out these false positives.

There are two very common ways to do this for Key-Value Pair Collation, the Maximum Distance and the Enforce Line Boundaries properties.

Maximum Distance

5a. Both Vertical Layout and Horizontal Layout have a Maximum Distance sub-property.
6a. You can set a length here, in inches, points, millimetres or centimetres. If the distance between the key and the value are greater than the length set, that result will be tossed out.
  • Here, we set the distance to 0.25in. This threw out the false positive result we were getting because the vertical distance between the first "home phone" label is much longer than 0.25 inches. Then, the extractor looked for the next key value, which does have a vertically aligned value next to it, and within 0.25 inches.
  • FYI: With the Maximum Distance property left blank, there is no maximum distance. Key-Value Pair collation will go from the "Key Extractor's" result all the way down the page looking for the "Value Extractor's" result. However, this will not span pages. The Horizontal and Vertical Layouts will only look for values next to a key on the same page.

Key-Value Pair - Grooper Screenshots 07.png

Enforce Line Boundaries

The other solution to this problem will only be possible if your document utilizes lines to break up sections or fields. For documents that do this, information is visually divided from other information by putting it in a box. This example also does that. The values horizontally aligned are put in one box. The values vertically aligned are put in another box. Because of this fact, we can take advantage of the Enforce Line Boundaries property.

5b. Select the Enforce Line Boundaries property
6b. Set this to True. If there is a line physically between the key and the value on the page, that result will be thrown out.
  • Note: We set the Maximum Distance property back to blank. We're not cheating here.
This will only work if the document has line location information in its folder or page's LayoutData.json file. This information must be obtained before the extractor runs via the Image Processing or Recognize activity.

Key-Value Pair - Grooper Screenshots 08.png

The Flow Layout

The Flow Layout is a little different. Instead of using the linear horizontal or vertical relationship between the key and value data instances, it uses their relationship in the "flow" of the text data. This method travels from the key looking for the value as an English reader would, starting at the key, going character-by-character left to right (typically) and line-by-line top to bottom.

  1. View the Key-Value Pair configuration properties by expanding the the Collation property.
  2. Select Flow Layout.
  3. Change the property from Disabled to Enabled.
  4. This is not the result we're looking for. The result we're looking for is the one in the paragraph at the bottom of the document. The Flow Layout generally requires more configuration than just enabling the property.

Key-Value Pair - Grooper Screenshots 10.png

The Flow Layout by default will look for a value separated from the key by a single space character (or if the first character of the value's result immediately follows the key in the ext flow).

As you can see in the text data, the phone number value extracted is separated from the key "Home Phone" by a single space. That is why this value in particular returns with no configuration of the Flow Layout properties.

In order to change the amount of allowable text between a key and a value, either the Separator Expression or Maximum Character Distance properties must be configured.

Key-Value Pair - Grooper Screenshots 11.png

The Greediest Separator Expression: The .* Expression

If you know the approximate number of characters between a key and a value in the text flow, you can use the Maximum Character Distance property. Say you know the value is about 100 characters after the key, you can enter 100 for the Maximum Character Distance. If the value falls within 100 characters after the key, it will return. If it's 101 or more characters, it will not.

But what if you don't know the distance between characters? One place people start is with a .* separator expression. The .* regular expression will match any character with a variable length of zero to unlimited characters. This expression is like setting a Maximum Character Distance of infinity.

This expression will work fine in some cases. However, if we do that here, you'll see we end up matching a lot of data we do not want.

For each key "Home Phone" found, the first phone number after it in the text flow is returned. Each value is satisfied by the Separator Expression .* used.

Key-Value Pair - Grooper Screenshots 12.png

A More Restrictive Separator Expression

Often, a more restrictive separator expression is needed. The good thing about these separator expressions is you have the whole toolbox of regular expression at your disposal.

For Flow Layout you also have a set of Preprocessing Options available as well, including Tab Marking. With tab marking enabled, tab characters replace space characters where whitespace gaps are longer than a normal space character. Since we are trying to match something in a written paragraph, we shouldn't ever see a tab character between the key and the value. So, we can match what we want by enabling Tab Marking' and setting the Separator Expression to [^\t]+

  1. Expand the Preprocessing Options under the Flow Layout properties.
  2. Change the Tab Marking property from Disabled to Enabled
  3. Select the Separator Expression property and enter [^\t]+
    • Now, the extractor will go character-by-character in the text flow looking for the value, traveling one to infinite characters, as long as that character is not a tab character. And, we end up getting the result we want.

Key-Value Pair - Grooper Screenshots 13.png