2023:List Match (Value Extractor)

From Grooper Wiki
Revision as of 09:54, 27 August 2024 by Randallkinard (talk | contribs)

This article is about an older version of Grooper.

Information may be out of date and UI elements may have changed.

202520242023

List Match is a Value Extractor designed to return values matching one or more items in a defined list. By default, the List Match extractor does not use or require regular expression, but can be configured to utilize regular expression syntax.

You may download the ZIP(s) below and upload it into your own Grooper environment (version 2023). The first contains one or more Batches of sample documents. The second contains one or more Projects with resources used in examples throughout this article.

About

The List Match is one of the the simplest extractors used in Grooper. It is designed to return values matching one or more items in a defined list. This can be used to specific words or full phrases contained within a document. A List Match extractor returns an exact match including any spaces, numbers, punctuation, or special characters.

To configure a List Match, you can input the desired extracted value as a Local Entry or reference a pre-configured Lexicon.

Unlike a Pattern Match, the List Match extractor does not use or require regular expressions by default, but regex can be enabled in the properties menu. Similar to a Pattern Match, Suffix and Prefix Patterns can be added to help anchor the list item and limit the amount of false positives extracted.

How To

A List Match is most commonly used when configuring objects such as Value Readers or Data Types. It is great for extracting text information such as:

  • Specific company names
  • Field labels
  • Headers and Footers
  • Full phrases
  • Exact numbers

If the information you need to extract follows a specific pattern, such as a date or social security number, then it may be better to consider a different extractor like a Pattern Match.

Configuring by Object Type

Configuring on a Value Reader

  1. In your Node Tree, create or select a Value Reader.
    • Visit the Value Reader Wiki Page for instructions on how to create a Value Reader.
  2. Select the "Value Reader" tab.
  3. Click the drop down list next to Extractor and select List Match.

Configuring on a Data Type

  1. In your Node Tree, create or select a Data Type.
    • Visit the Data Type Wiki Page for instructions on how to create a Data Type.
  2. Select the "Data Type" tab.
  3. Click the drop down list next to Local Extractor and select List Match.

Configuring on Other Object Types

The List Match extractor can be used on a multitude of object types. Any object that has an extractor property can be configured with a List Match.

The configuration process on other objects is identical to both the Value Reader and Data Type objects. Simply select a List Match as your extractor type.


Examples where you can use a List Match include:

  • A Data Type's Value Extractor property
  • A Document Type's Positive Extractor property
  • The Labeled Value extractor's Label Extractor property
  • The Pattern-Based Separation Provider's Value Extractor property

Click here to return to the top of the section

Local Entries vs Lexicons

A List Match can be configured using a Local Entry or a Lexicon. Local Entries are simple and easy to set up, especially if you only need to add a few entries. If you plan to extract a large number of items from a list or plan on building multiple extractors using the same list, it might be more efficient to set up a Lexicon to reference first.

Configuring Local Entries

  1. For Value Readers, select the object you wish to configure and click "Tester" tab.
    • When configuring a Data Type, first click the ellipsis button at the end of the Local Extractor property with List Match selected to bring up the editing window.
  2. Make sure the "Expressions" sub-tab is selected.

  1. Under LOCAL ENTRIES, type the desired text to be extracted.
    • Hit Enter after each entry to extract multiple list items under one List Match.
  2. If needed, add a Prefix and Suffix Pattern to anchor your extraction to a regex pattern.
    • When using tabs as an anchor (\t) make sure Tab Marking is set to Enabled under Preprocessing in your "Properties" tab.
  3. Save and test your extraction.

Referencing Lexicons

  1. For Value Readers, select the object you wish to configure and click the "Tester" tab.
    • When configuring a Data Type, click the ellipsis button at the end of the Local Extractor property with List Match selected to bring up the editing window before continuing to the next step.
  2. Select the "Properties" tab.

  1. Click the arrow next to the Vocabulary property to expand its sub-properties.
  2. Click the ellipsis button at the end of the Included Lexicons property. This will open a new window where you can add pre-configured Lexicons.

  1. In the new window, click through the Projects and Folders until you find the desired Lexicon. Click the check boxes next to the desired Lexicons.
  2. Click OK to apply the Lexicons.

  1. Save and test your extraction.


Click here to return to the top of the section

See Also

Glossary

Batch: inventory_2 Batch nodes are fundamental in Grooper's architecture. They are containers of documents that are moved through workflow mechanisms called settings Batch Processes. Documents and their pages are represented in Batches by a hierarchy of folder Batch Folders and contract Batch Pages.

Data Type: pin Data Types are nodes used to extract text data from a document. Data Types have more capabilities than quick_reference_all Value Readers. Data Types can collect results from multiple extractor sources, including a locally defined extractor, child extractor nodes, and referenced extractor nodes. Data Types can also collate results using Collation Providers to combine, sift and manipulate results further.

Document Type: description Document Type nodes represent a distinct type of document, such as an invoice or a contract. Document Types are created as child nodes of a stacks Content Model or a collections_bookmark Content Category. They serve three primary purposes:

  1. They are used to classify documents. Documents are considered "classified" when the folder Batch Folder is assigned a Content Type (most typically, a Document Type).
  2. The Document Type's data_table Data Model defines the Data Elements extracted by the Extract activity (including any Data Elements inherited from parent Content Types).
  3. The Document Type defines all "Behaviors" that apply (whether from the Document Type's Behavior settings or those inherited from a parent Content Type).

Expressions: Expressions (not to be confused with regular expressions) are snippets of VB.NET code that expand Grooper's core functionality.

Extract: export_notes Extract is an Activity that retrieves information from folder Batch Folder documents, as defined by Data Elements in a data_table Data Model. This is how Grooper locates unstructured data on your documents and collects it in a structured, usable format.

Extractor Type:

Labeled Value: Labeled Value is a Value Extractor that identifies and extracts a value next to a label. This is one of the most commonly used extractors to extract data from structured documents (such as a standardized form) and static values on semi-structured documents (such as the header details on an invoice).

Lexicon: dictionary Lexicons are dictionaries used throughout Grooper to store lists of words, phrases, weightings for Fuzzy RegEx, and more. Users can add entries to a Lexicon, Lexicons can import entries from other Lexicons by referencing them, and entries can be dynamically imported from a database using a database Data Connection. Lexicons are commonly used to aid in data extraction, with the "List Match" and "Word Match" extractors utilizing them most commonly.

List Match: List Match is a Value Extractor designed to return values matching one or more items in a defined list. By default, the List Match extractor does not use or require regular expression, but can be configured to utilize regular expression syntax.

Node Tree: The Node Tree is the hierarchical list of Grooper node objects found in the left panel in the Design Page. It is the basis for navigation and creation in the Design Page.

Pattern Match: Pattern Match is a Value Extractor that extracts values from a document that match a specified regular expression, providing data collection following a known format or pattern.

Pattern-Based Separation: Pattern-Based Separation is a Separation Provider that creates a new document folder every time a value returned by a defined pattern is encountered on a page.

Pattern-Based: Pattern-Based is a Collation Provider option for pin Data Type extractors. Pattern-Based uses regular expressions to sequence returned results into a final result set.

Project: package_2 Projects are the primary containers for configuration nodes within Grooper. The Project is where various processing objects such as stacks Content Models, settings Batch Processes, profile objects are stored. This makes resources easier to manage, easier to save, and simplifies how node references are made in a Grooper Repository.

Separation: Separation is the process of taking an unorganized inventory_2 Batch of loose contract Batch Pages and organizing them into documents represented by folder Batch Folders in Grooper. This is done so Grooper can later assign a description Document Type to each document folder in a process known as "classification".

Tab Marking: Tab Marking allows you to insert tab characters into a document's text data.

Value Reader: quick_reference_all Value Reader nodes define a single data extraction operation. Each Value Reader executes a single Value Extractor configuration. The Value Extractor determines the logic for returning data from a text-based document or page. (Example: Pattern Match is a Value Extractor that returns data using regular expressions).

  • Value Readers are can be used on their own or in conjunction with pin Data Types for more complex data extraction and collation.