2023:Labeled Value (Value Extractor): Difference between revisions
No edit summary |
No edit summary |
||
| Line 183: | Line 183: | ||
===Advanced Setup=== | ===Advanced Setup=== | ||
So, you have your '''''Label Extractor''''' and '''''Value Extractor''''' properly configured, but you are still not getting the results you want. There are a couple of properties you can configure to try and improve your results: '''''Maximum Distance''''' and '''''Maximum Noise''''. | So, you have your '''''Label Extractor''''' and '''''Value Extractor''''' properly configured, but you are still not getting the results you want. There are a couple of properties you can configure to try and improve your results: '''''Maximum Distance''''' and '''''Maximum Noise'''''. | ||
<tabs style="margin:20px"> | <tabs style="margin:20px"> | ||
| Line 248: | Line 248: | ||
|- | |- | ||
|valign=top style="width:40%| | |valign=top style="width:40%"| | ||
After you have your '''''Label Extractor''''' and '''''Value Extractor''''' properly configured you may find that you are still not getting results. This could possibly be due to noise characters between your label and your value. | After you have your '''''Label Extractor''''' and '''''Value Extractor''''' properly configured you may find that you are still not getting results. This could possibly be due to noise characters between your label and your value. | ||
Noise characters are any alpha-numeric characters that come between your label and value. By default, the '''''Labeled Value''''' extractor accounts for 5 noise characters between the label and the value. If there are more than 5 noise characters, you may not get results until the '''''Maximum Noise''''' property is increased. | Noise characters are any alpha-numeric characters that come between your label and value. By default, the '''''Labeled Value''''' extractor accounts for 5 noise characters between the label and the value. If there are more than 5 noise characters, you may not get results until the '''''Maximum Noise''''' property is increased. | ||
|- | |||
|valign=top style="width:40%"| | |||
In this example we are trying to extract the Mailing Address for Tarah Dactyl. First, we have created a '''''List Match''''' extractor as our '''''Label Extractor''''' for "Mailing Address". | In this example we are trying to extract the Mailing Address for Tarah Dactyl. First, we have created a '''''List Match''''' extractor as our '''''Label Extractor''''' for "Mailing Address". | ||
| | |||
[[File:2023-Labeled Value-How To 15.png]] | |||
|- | |||
|valign=top style="width:40%"| | |||
Next, we have referenced a pre-configured extractor designed to collect all generic text on the document. | Next, we have referenced a pre-configured extractor designed to collect all generic text on the document. | ||
| | |||
[[File:2023-Labeled Value-How To 14.png]] | |||
|- | |||
|valign=top style="width:40%"| | |||
However, with both of those properly configured, we are still not getting the desired result. To the right of the label, you can see that we have some noise characters: "Street" (highlighted in yellow). If we count the letters in the word "Street" we can see that we have 6 noise characters. That is more than the preconfigured 5 allowable noise characters. | |||
| | |||
[[File:2023-Labeled Value-How To 16.png]] | |||
|- | |||
|valign=top style="width:40%"| | |||
Here we have increased the '''''Maximum Noise''''' to 6, and as you can see, we are now getting the result that we want. | |||
'''NOTE:''' It is best practice not to increase your '''''Maximum Noise''''' too high. In some cases it could produce undesirable results. Noise characters can often be helpful in not collecting false positives. | |||
| | | | ||
[[File:2023-Labeled Value-How To 17.png]] | |||
</tab> | </tab> | ||
Revision as of 10:20, 16 February 2023
| WIP |
This article is a work-in-progress or created as a placeholder for testing purposes. This article is subject to change and/or expansion. It may be incomplete, inaccurate, or stop abruptly. This tag will be removed upon draft completion. |
A Labeled Value is an extractor type that can be used when configuring several data extraction tools such as a Value Reader or Data Type. It is designed to return text segments that have a spatial relationship to a defined label.
About
A Labeled Value is configured using two extractors: the Label Extractor and the Value Extractor. When the Label Extractor is set, Grooper uses spatial context to determine what value to return (for more on spatial context, see the Data Context wiki article).
The Labeled Value extracts information similarly to collating a Key-Value Pair. Unlike a Key-Value Pair, the extractor is self-contained in one object. There is no need to set one object as a "Key" and another object as a "Value". Instead, both the "Label" and the "Value" can be set on one object.
How To
Selecting the Extractor
Configuring on a Value Reader | |
|
Configuring on a Data Type | |
|
Configuring on Other Object Types | |
|
The Labeled Value extractor can be used on a multitude of object types. For each one, the configuration process is similar to both the Value Reader and Data Type objects. You may have to select a specific tab to find the extractor property. |
Basic Setup
Once you select Labeled Value as your extractor, you will need to set both your label and your value to be extracted.
Configuring the Label Extractor | |
|
|
|
Configuring the Value Extractor | |
|
|
|
|
|
The goal when setting up this part of the Labeled Value extractor is not to collect an individual text segment, but rather collect the type of information you are trying to extract. In this example, we are referencing an extractor designed to collect all dates on the document. The Label Extractor will let Grooper know which individual text segment to extract. |
The Final Result | |
|
Once your Label Extractor and Value Extractor are set up, you should see some results. As you can see in this example, since the "Order Date" was set as the Label Extractor, the date value that is returned is the closest to the label. If you set your Label Extractor and Value Extractor up properly and Grooper is still not returning the results you want, look below at the Advanced Setup section. There you will find information on how to increase the accuracy of your results. |
|
Advanced Setup
So, you have your Label Extractor and Value Extractor properly configured, but you are still not getting the results you want. There are a couple of properties you can configure to try and improve your results: Maximum Distance and Maximum Noise.
Maximum Distance | |
|
Once the Label Extractor and the Value Extractor have been set, check to see if the correct text segment is being extracted. Sometimes Grooper will produce an undesirable result based on layout of the document. When this happens, the Maximum Distance property can be edited to hopefully improve your results. By default, the Maximum Distance is set at 2in to the right and 2in to the bottom. This setting is telling Grooper to look for a value that is located two inches to the right and two inches below the Label Extractor. If Grooper finds a result within that 2x2 region, it will return that result as the value. If Grooper finds multiple results within that 2x2 region, it will return the closest result as the value. |
|
|
In the picture to the right, we can see that the extractor is working appropriately. With the layout of the Order Date and Invoice Date, Grooper is able to pick out the Order Date with the default settings. |
|
|
However, in this picture, Grooper is extracting the wrong value. There are many times where the spatial layout of a document would require additional configuration. So, why is Grooper grabbing the wrong value? |
|
|
|
|
|
If we change the Maximum Distance to just "2in" to the right, we see that the zone changes. Only the correct date is within the range of the Maximum Distance. |
|
|
With the Maximum Distance properly configured, we see that Grooper now extracts the correct value. |
Maximum Noise | |
|
After you have your Label Extractor and Value Extractor properly configured you may find that you are still not getting results. This could possibly be due to noise characters between your label and your value. Noise characters are any alpha-numeric characters that come between your label and value. By default, the Labeled Value extractor accounts for 5 noise characters between the label and the value. If there are more than 5 noise characters, you may not get results until the Maximum Noise property is increased. | |
|
In this example we are trying to extract the Mailing Address for Tarah Dactyl. First, we have created a List Match extractor as our Label Extractor for "Mailing Address". |
|
|
Next, we have referenced a pre-configured extractor designed to collect all generic text on the document. |
|
|
However, with both of those properly configured, we are still not getting the desired result. To the right of the label, you can see that we have some noise characters: "Street" (highlighted in yellow). If we count the letters in the word "Street" we can see that we have 6 noise characters. That is more than the preconfigured 5 allowable noise characters. |
|
|
Here we have increased the Maximum Noise to 6, and as you can see, we are now getting the result that we want. NOTE: It is best practice not to increase your Maximum Noise too high. In some cases it could produce undesirable results. Noise characters can often be helpful in not collecting false positives. |
|
















