2023:Labeled Value (Value Extractor): Difference between revisions

From Grooper Wiki
No edit summary
Tag: Reverted
No edit summary
Tag: Reverted
Line 11: Line 11:
* [[Media:2023 Wiki Labeled-Value Project.zip]]
* [[Media:2023 Wiki Labeled-Value Project.zip]]
|}
|}
== Glossary ==
<u><big>'''Batch'''</big></u>: {{#lst:Glossary|Batch}}
<u><big>'''Collation Provider'''</big></u>: {{#lst:Glossary|Collation Provider}}
<u><big>'''Data Context'''</big></u>: {{#lst:Glossary|Data Context}}
<u><big>'''Data Type'''</big></u>: {{#lst:Glossary|Data Type}}
<u><big>'''Document Type'''</big></u>: {{#lst:Glossary|Document Type}}
<u><big>'''Extract'''</big></u>: {{#lst:Glossary|Extract}}
<u><big>'''Extractor Type'''</big></u>: {{#lst:Glossary|Extractor Type}}
<u><big>'''Key-Value Pair'''</big></u>: {{#lst:Glossary|Key-Value Pair}}
<u><big>'''Labeled Value'''</big></u>: {{#lst:Glossary|Labeled Value}}
<u><big>'''List Match'''</big></u>: {{#lst:Glossary|List Match}}
<u><big>'''Node Tree'''</big></u>: {{#lst:Glossary|Node Tree}}
<u><big>'''Pattern-Based Separation'''</big></u>: {{#lst:Glossary|Pattern-Based Separation}}
<u><big>'''Pattern-Based'''</big></u>: {{#lst:Glossary|Pattern-Based}}
<u><big>'''Project'''</big></u>: {{#lst:Glossary|Project}}
<u><big>'''Reference'''</big></u>: {{#lst:Glossary|Reference}}
<u><big>'''Separation'''</big></u>: {{#lst:Glossary|Separation}}
<u><big>'''Value Reader'''</big></u>: {{#lst:Glossary|Value Reader}}


==About==
==About==
Line 350: Line 315:
* [[Value Reader]]
* [[Value Reader]]
* [[Data Type]]
* [[Data Type]]
== Glossary ==
<u><big>'''Batch'''</big></u>: {{#lst:Glossary|Batch}}
<u><big>'''Collation Provider'''</big></u>: {{#lst:Glossary|Collation Provider}}
<u><big>'''Data Context'''</big></u>: {{#lst:Glossary|Data Context}}
<u><big>'''Data Type'''</big></u>: {{#lst:Glossary|Data Type}}
<u><big>'''Document Type'''</big></u>: {{#lst:Glossary|Document Type}}
<u><big>'''Extract'''</big></u>: {{#lst:Glossary|Extract}}
<u><big>'''Extractor Type'''</big></u>: {{#lst:Glossary|Extractor Type}}
<u><big>'''Key-Value Pair'''</big></u>: {{#lst:Glossary|Key-Value Pair}}
<u><big>'''Labeled Value'''</big></u>: {{#lst:Glossary|Labeled Value}}
<u><big>'''List Match'''</big></u>: {{#lst:Glossary|List Match}}
<u><big>'''Node Tree'''</big></u>: {{#lst:Glossary|Node Tree}}
<u><big>'''Pattern-Based Separation'''</big></u>: {{#lst:Glossary|Pattern-Based Separation}}
<u><big>'''Pattern-Based'''</big></u>: {{#lst:Glossary|Pattern-Based}}
<u><big>'''Project'''</big></u>: {{#lst:Glossary|Project}}
<u><big>'''Reference'''</big></u>: {{#lst:Glossary|Reference}}
<u><big>'''Separation'''</big></u>: {{#lst:Glossary|Separation}}
<u><big>'''Value Reader'''</big></u>: {{#lst:Glossary|Value Reader}}

Revision as of 09:56, 27 August 2024

This article is about an older version of Grooper.

Information may be out of date and UI elements may have changed.

20252023

Labeled Value is a Value Extractor that identifies and extracts a value next to a label. This is one of the most commonly used extractors to extract data from structured documents (such as a standardized form) and static values on semi-structured documents (such as the header details on an invoice).

You may download the ZIP(s) below and upload it into your own Grooper environment (version 2023). The first contains one or more Batches of sample documents. The second contains one or more Projects with resources used in examples throughout this article.

About

Labeled Value is configured using two extractors: the Label Extractor and the Value Extractor. When the Label Extractor is set, Grooper uses spatial context to determine what value to return (for more on spatial context, see the Data Context wiki article).

Labeled Value extracts information similarly to collating a Key-Value Pair. Unlike a Key-Value Pair, the extractor is self-contained in one object. There is no need to set one object as a "Key" and another object as a "Value". Instead, both the "Label" and the "Value" can be set on one object.

How To

Selecting the Extractor

Configuring on a Value Reader

  1. In your Node Tree, create or select a Value Reader.
    • Visit the Value Reader Wiki Page for instructions on how to create a Value Reader.
  2. Select the "Value Reader" tab.
  3. Click the drop down list next to Extractor and select Labeled Value.

Configuring on a Data Type

  1. In your Node Tree, create or select a Data Type.
    • Visit the Data Type Wiki Page for instructions on how to create a Data Type.
  2. Select the "Data Type" tab.
  3. Click the drop down list next to Local Extractor and select Labeled Value.


Configuring on Other Object Types

The Labeled Value extractor can be used on a multitude of object types. Any object that has an extractor property can be configured with an Labeled Value.

The configuration process on other objects is identical to both the Value Reader and Data Type objects. Simply select Labeled Value as your extractor type.


Examples where you can use an Labeled Value include:

  • A Data Type's Value Extractor property
  • A Document Type's Positive Extractor property
  • The Labeled Value extractor's Label Extractor property (Yes, you can configure a Labeled Value within a Labeled Value).
  • The Pattern-Based Separation Provider's Value Extractor property

Click here to return to the top of the section

Basic Setup

Once you select Labeled Value as your extractor, you will need to set both your label and your value to be extracted.

Configuring the Label Extractor

  1. Once Labeled Value is set as the extractor on the object you are configuring, click on the "Tester" tab.
  2. Click the drop down next to the Label Extractor property and select an extractor to use.
    • For the purposes of this example, we are going to use a List Match extractor. However, any extractor can be used to capture the desired label.


  1. Configure your chosen extractor to collect the desired label.

Configuring the Value Extractor

  1. Once Labeled Value is set as the extractor on the object you are configuring, click on the "Tester" tab.
  2. Click the drop down next to the Value Extractor property and select an extractor to use.
    • For the purposes of this example, we are going to use a Reference to another extractor that has been previously configured. However, any extractor can be used to capture the desired values.

FYI

The extractor we are referencing is configured to return dates.

The goal when setting up the Value Extractor is not to collect an individual text segment, but rather collect the type of information you are trying to extract. The Label Extractor will let Grooper know which individual text segment to extract. In this case, the "Invoice Date".


  1. Configure your chosen extractor to collect the desired values.
    • When both the Label Extractor and Value Extractor are set up properly, the label will be outlined in blue while the extracted value will be highlighted in green.


The Final Result

Once your Label Extractor and Value Extractor are set up, you should see some results. As you can see in this example, since the "Order Date" was set as the Label Extractor, the date value that is returned is the closest to the label.

If you set your Label Extractor and Value Extractor up properly and Grooper is still not returning the results you want, look below at the Advanced Setup section. There you will find information about more properties you can configure to increase the accuracy of your results.


Click here to return to the top of the section

Advanced Setup

So, you have your Label Extractor and Value Extractor properly configured, but you are still not getting the results you want. There are a couple of properties you can configure to try and improve your results: Maximum Distance and Maximum Noise.

Maximum Distance

Sometimes Grooper will produce an undesirable result based on layout of the document. The Labeled Value extractor will not return a result if the value is further away from the label than Grooper expects. When this happens, the Maximum Distance property can be configured to improve your results.

By default, the Maximum Distance is set at 2in to the right and 2in to the bottom. This setting is telling Grooper to look for a value that is located two inches to the right and two inches below the Label Extractor.

If Grooper finds a result within that 2x2 region, it will return that result as the value. If Grooper finds multiple results within that 2x2 region, multiple factors will be considered when returning a result, including proximity to the label and number of "noise" characters present (for more information on noise, see the following tabs).


In the picture to the right, we can see that the extractor is working appropriately. With the layout of the Order Date and Invoice Date, Grooper is able to pick out the Order Date with the default settings. The desired value falls within the Maximum Distance of 2x2 inches to the right and bottom of the label.

However, in this picture, Grooper is extracting the wrong value. There are many times where the spatial layout of a document would require additional configuration. So, why is Grooper grabbing the wrong value?


With the default Maximum Distance selected, all values within this zone are considered. Both the Invoice Date and the Order Date are within the range of the Maximum Distance. Since the Invoice Date is closer to the "Order Date" label, Grooper is returning this as the value.

If we change the Maximum Distance to just "2in" to the right, we see that the zone changes. Only the correct date is within the range of the Maximum Distance.

With the Maximum Distance properly configured, we see that Grooper now extracts the correct value.

Maximum Noise

After you have your Label Extractor and Value Extractor properly configured you may find that you are still not getting results. This could possibly be due to noise characters between your label and your value.

Noise characters are any alpha-numeric characters that come between your label and value. By default, the Labeled Value extractor accounts for 5 noise characters between the label and the value. If there are more than 5 noise characters, you may not get results until the Maximum Noise property is increased.

In the picture to the right, with the Maximum Distance set to the default of 2 inches to the right and 2 inches to the bottom, we get two potential value hits.

In this case you can see we have 18 characters of noise between the "Order Date" label and the 7/20/2022 date. If the Maximum Noise property is left at the default of 5 noise characters, Grooper will not return this value since 18 is greater than 5.

FYI

Notice that the colon next to the label and the forward slashes in the date are not included in the noise character count. Only alpha-numeric characters count as noise. Punctuation and whitespaces do not.

Instead, Grooper will see that there are zero noise characters between the "Order Date" label and the 7/25/2022 date. Since 0 is fewer than the default of 5 noise characters, Grooper will return this value.

Let's look at an example of how this works in Grooper. In this example we are trying to extract the Mailing Address for Tarah Dactyl. First, we have created a List Match extractor as our Label Extractor for "Mailing Address".

Next, we have referenced a pre-configured extractor designed to collect all generic text segments on the document.

However, with both of those properly configured, we are still not getting the desired result. To the right of the label, you can see that we have some noise characters: "Street" (highlighted in yellow). If we count the letters in the word "Street" we can see that we have 6 noise characters. That is more than the preconfigured 5 allowable noise characters.

Here we have increased the Maximum Noise to 6, and as you can see, we are now getting the result that we want.

FYI

It is best practice not to increase your Maximum Noise too high. In some cases it could produce undesirable results. Noise characters can often be helpful in avoiding collecting false positives.



Click here to return to the top of the section

See Also

Glossary

Batch: inventory_2 Batch nodes are fundamental in Grooper's architecture. They are containers of documents that are moved through workflow mechanisms called settings Batch Processes. Documents and their pages are represented in Batches by a hierarchy of folder Batch Folders and contract Batch Pages.

Collation Provider: The Collation property of a pin Data Type defines the method for converting its raw results into a final result set. It is configured by selecting a Collation Provider. The Collation Provider governs how initial matches from the Data Type's extractor(s) are combined and interpreted to produce the Data Type's final output.

Data Context: Data Context refers to contextual information used to extract data, such as a label that identifies the value you want to collect.

Data Type: pin Data Types are nodes used to extract text data from a document. Data Types have more capabilities than quick_reference_all Value Readers. Data Types can collect results from multiple extractor sources, including a locally defined extractor, child extractor nodes, and referenced extractor nodes. Data Types can also collate results using Collation Providers to combine, sift and manipulate results further.

Document Type: description Document Type nodes represent a distinct type of document, such as an invoice or a contract. Document Types are created as child nodes of a stacks Content Model or a collections_bookmark Content Category. They serve three primary purposes:

  1. They are used to classify documents. Documents are considered "classified" when the folder Batch Folder is assigned a Content Type (most typically, a Document Type).
  2. The Document Type's data_table Data Model defines the Data Elements extracted by the Extract activity (including any Data Elements inherited from parent Content Types).
  3. The Document Type defines all "Behaviors" that apply (whether from the Document Type's Behavior settings or those inherited from a parent Content Type).

Extract: export_notes Extract is an Activity that retrieves information from folder Batch Folder documents, as defined by Data Elements in a data_table Data Model. This is how Grooper locates unstructured data on your documents and collects it in a structured, usable format.

Extractor Type:

Key-Value Pair: Key-Value Pair is a Collation Provider option for pin Data Type extractors. Key-Value Pair matches instances where a key is paired with a value on the document in a specific layout. Note: Key-Value Pair is an older technique in Grooper. In most cases, the Labeled Value extractor is preferable to Key-Value Pair collation.

Labeled Value: Labeled Value is a Value Extractor that identifies and extracts a value next to a label. This is one of the most commonly used extractors to extract data from structured documents (such as a standardized form) and static values on semi-structured documents (such as the header details on an invoice).

List Match: List Match is a Value Extractor designed to return values matching one or more items in a defined list. By default, the List Match extractor does not use or require regular expression, but can be configured to utilize regular expression syntax.

Node Tree: The Node Tree is the hierarchical list of Grooper node objects found in the left panel in the Design Page. It is the basis for navigation and creation in the Design Page.

Pattern-Based Separation: Pattern-Based Separation is a Separation Provider that creates a new document folder every time a value returned by a defined pattern is encountered on a page.

Pattern-Based: Pattern-Based is a Collation Provider option for pin Data Type extractors. Pattern-Based uses regular expressions to sequence returned results into a final result set.

Project: package_2 Projects are the primary containers for configuration nodes within Grooper. The Project is where various processing objects such as stacks Content Models, settings Batch Processes, profile objects are stored. This makes resources easier to manage, easier to save, and simplifies how node references are made in a Grooper Repository.

Reference: Reference is a Value Extractor used to reference an Extractor Node. This allows users to create re-usable extractors and use the more complex pin Data Type and input Field Class extractors throughout Grooper.

Separation: Separation is the process of taking an unorganized inventory_2 Batch of loose contract Batch Pages and organizing them into documents represented by folder Batch Folders in Grooper. This is done so Grooper can later assign a description Document Type to each document folder in a process known as "classification".

Value Reader: quick_reference_all Value Reader nodes define a single data extraction operation. Each Value Reader executes a single Value Extractor configuration. The Value Extractor determines the logic for returning data from a text-based document or page. (Example: Pattern Match is a Value Extractor that returns data using regular expressions).

  • Value Readers are can be used on their own or in conjunction with pin Data Types for more complex data extraction and collation.