Data Extractor (Concept): Difference between revisions

From Grooper Wiki
 
(One intermediate revision by the same user not shown)
Line 70: Line 70:
==== LLM-based extractors ====
==== LLM-based extractors ====


These extractors use generative AI to return results. The document content and other prompts defined by the user is fed to a large language model (LLM) for analysis. To utilize these extractors, you must add an "[[LLM Connector]]" Repository Option to the Grooper Root.
These extractors use generative AI to return results. The document text and other prompts defined by the user are fed to a large language model (LLM) for analysis. To utilize these extractors, you must add an "[[LLM Connector]]" Repository Option to the Grooper Root.


* [[Ask AI]]
* [[Ask AI]]
Line 92: Line 92:


==== The Reference extractor ====
==== The Reference extractor ====
The Reference extractor is unique among the extractor types. It allows users to reference an extractor node (such as a {{IconName|Data Type}} Data Type or {{IconName|Value Reader}} Value Reader).
The Reference extractor is unique among the extractor types. It allows users to reference the results of an extractor node (such as a {{IconName|Data Type}} Data Type or {{IconName|Value Reader}} Value Reader).
* [[Reference]]
* [[Reference]]
</div>
</div>

Latest revision as of 11:42, 28 August 2025

This article is about the current version of Grooper.

Note that some content may still need to be updated.

2025 2.90

Data Extractor (or just "extractor") refers to all Value Extractors and Extractor Nodes. Extractors define the logic used to return data from a document's text content, including general data (such as a date) and specific data (such as an agreement date on a contract).

Data extractors (or simply "extractors") are used in a variety of ways, including (but not limited to):


Extractors are highly configurable in terms of how data is targeted, how it is ordered and sorted, how text is pre-processed, what tolerance the extractor has for "fuzzy" results, how results are post-processed and more. However, at their core, extractors are simply tools used to parse text data from a larger text source (e.g. a single field from a whole document).

Value Extractors

Value Extractors define an operation that reads data from the text (and sometimes visual) content of a page or document. There are over 20 unique Value Extractors, each using specialized logic to return results. Value Extractors are consumed by multiple higher-level objects in Grooper (such as Data Elements, Extractor Nodes, various Activities and more) to perform a diverse set of document processing duties.

  • Value Extractors return a list of one or more "data instances". Data instances contain both the value and its page location, which allows Grooper to highlight results in a Document Viewer.

"Value Extractors" are the primitive operators in Grooper that perform "data extraction". They read data values from the text (and sometimes visual content) of a page or document. This is the foundation of Grooper's capability to locate and collect document data, such as dates, numbers, entity names, barcodes, checkboxes, text labels, paragraphs, and more.


Value Extractors (or "extractors" for short) are executed when various Activities run through a Batch Process.

  • Example: Extractors can be used to collect a data_table Data Model's data during an export_notes Extract step.
  • Example: Extractors can assist in document classification during a unknown_document Classify step.
  • Example: Extractors can assist in document separation during a insert_page_break Separate step.


To accomplish these ends, Value Extractors added to and configured in various nodes and their properties in Grooper.

  • Example: A variables Data Field's "Value Extractor" property.
  • Example: A quick_reference_all Value Reader's "Extractor" property.
  • Example: Misceleneous pin Data Type properties ("Local Extractor", "Input Filter", "Exclusion Extractor" and "Subtraction Extractor")


There are around 100 different configuration properties that allow you to configure an extractor to return data from a document. There are several different kinds of Value Extractors too. Think of these as specialized tools in your data extraction toolkit.

For any extractor property, you can choose one of Grooper's available Value Extractors (or just "extractor" for short).

  • Example: The Pattern Match extractor returns results that match a regular expression pattern, such as a pattern that matches various date/time formats.
  • Example: The List Match extractor returns results that match items in a list, such as a state name from a list of US sates.

Value Extractor types in Grooper

"Value Extractor" is a base class of 24 unique extractor types. This includes:

Text parsing extractors

These extractors primarily rely on regular expression, lists of values (such as a dictionary Lexicon of field labels) or other forms of text parsing to return values.

  • Please note, regular expression and other forms of text parsing is the "bread and butter" of how Grooper data extraction works. Other extractors may also utilize regex or other forms of text parsing as part of their configuration. These extractors just rely on it more heavily.

OMR extractors

These extractors allow you to return values using optical mark recognition. These are useful for extracting values on documents that use checkboxes to detail information.

Barcode extractors

These extractors allow you to return a value encoded in a barcode.

Zonal extractors

These extractors extract data by drawing a logical rectangle somewhere on a document. These are useful for extracting values on highly structured documents where field values are consistently located on the same position on the page for every document.

LLM-based extractors

These extractors use generative AI to return results. The document text and other prompts defined by the user are fed to a large language model (LLM) for analysis. To utilize these extractors, you must add an "LLM Connector" Repository Option to the Grooper Root.

Text analysis extractors (experimental)

These extractors use the Azure AI Language cloud service to analyze document text. To utilize these extractors, you must add a "Text Analysis" Repository Option to the Grooper Root.

  • BE AWARE: These features are still in development and should be considered "experimental". They have not been extensively tested or implemented.
  • Entity Recognition
  • Key Phrase Recognition
  • PII Entity Recognition

Miscellaneous extractors

These extractors have specialized uses and don't fit in well into the other categories.

The Reference extractor

The Reference extractor is unique among the extractor types. It allows users to reference the results of an extractor node (such as a pin Data Type or quick_reference_all Value Reader).

Extractor Nodes

Types of Extractor Nodes

There are three types of Extractor Nodes in Grooper:

quick_reference_all Value Reader
pin Data Type
input Field Class
  • Advances in large-language models (LLMs) have largely made Field Classes obsolete. LLM-based extraction methods in Grooper (such as AI Extract) can achieve similar results with nowhere near the amount of set up.

All three of these node types perform a similar function. They return data from documents. However, they differ in their configuration and utility.

Extractor Nodes are tools to extract/return document data. But they don't do anything by themselves. They are used by extractor properties on other nodes in Grooper.

  • Example: When export_notes runs on a document, Data Elements (such as variables Data Fields) are ultimately used collect document data.
    It is a Data Field's "Value Extractor" property that does this. You may configure this property with an Extractor Node to do so.
  • Example: When executed in a insert_page_break Separate step, the Pattern-Based Separation provider's is ultimately what identifies patterns to separate Batch Pages into Batch Folders.
    It is its "Value Extractor" property that does this. However, you may configure this property with an Extractor Node to do so.
  • Example: When unknown_document Classify runs on a document, a description Document Type's "Positive Extractor" property will be used to assign a Batch Folder the Document Type if it returns a value.
    You may configure the Positive Extractor with an Extractor Node to do so.
  • And so on and so on for any extractor property for any node in Grooper.


To that end, Extractor Nodes serve three purposes:

  1. To be re-usable units of extraction
  2. To collate data
  3. To leverage machine learning algorithms to target data in the flow of text
    • Advances in large-language models (LLMs) have largely made Field Classes obsolete. LLM-based extraction methods in Grooper (such as AI Extract) can achieve similar results with nowhere near the amount of set up.

Re-usability

Extractor nodes are meant to be referenced either by other extractor nodes or, importantly, by Data Elements such as Data Fields in a Data Model.

For example, an individual Data Field can be configured on its own to collect a date value, such as the "Received Date" on an invoice. However, what if another Data Field is collecting a different date format, like the "Due Date" on the same invoice? In this case you would create one extractor node, like a Value Reader, to collect any and all date formats. You could then have each Data Field reference that single Value Reader and further configure each individual Data Field to differentiate their specific date value.

Data collation

Another example would be configuring a Data Type to target entire rows of information within a table of data. Several Value Reader nodes could be made as children of the Data Type, each targeting a specific value within the table row. The parent Data Type would then collate the results of its child Value Reader nodes into one result. A Data Table would then reference the Data Type to collect the appropriate rows of information.

Machine learning

Many documents contain important pieces of information buried within the flow of text, like a legal document. These types of documents and the data they contain require an entirely different approach to extracting data than a highly structured document like an invoice. For these situations you can use a "trainable" extractor known as a Field Class to leverage machine learning algorithms to target important information.

  • Advances in large-language models (LLMs) have largely made Field Classes obsolete. LLM-based extraction methods in Grooper (such as AI Extract) can achieve similar results with nowhere near the amount of set up.

Extractor Nodes vs Value Extractors

Extractor nodes should not be confused with "Value Extractors". There are many places in Grooper where extraction logic can be applied for one purpose or another. In these cases a Value Extractor is chosen to define the logic required to return a desired value.

In fact, the Extractor Nodes themselves will leverage specific Value Extractors to define their logic.

  • Example: "Value Readers" are configured using a single property "Extractor". This property specifies a single Value Extractor which determines how data is extracted. Value Readers are essentially an encapsulation of a single Value Extractor configuration that can be reused by multiple other extraction elements and properties, such as Data Fields and Data Types.
  • Example: "Data Types" have several properties that can be configured with Value Extractors, including its "Local Extractor", "Input Filter", and "Exclusion Extractor" properties.
  • Example" "Field Classes" cannot function without its "Value Extractor" and "Feature Extractor" properties configured, both of which specify a Value Extractor.

However, Extractor Nodes are used when you need to reference them for their designated strengths:

  • re-usability
  • collation
  • machine learning
    • Advances in large-language models (LLMs) have largely made Field Classes obsolete. LLM-based extraction methods in Grooper (such as AI Extract) can achieve similar results with nowhere near the amount of set up.

Related Node Types

Value Reader

quick_reference_all Value Reader nodes define a single data extraction operation. Each Value Reader executes a single Value Extractor configuration. The Value Extractor determines the logic for returning data from a text-based document or page. (Example: Pattern Match is a Value Extractor that returns data using regular expressions).

  • Value Readers are can be used on their own or in conjunction with pin Data Types for more complex data extraction and collation.

Data Type

pin Data Types are nodes used to extract text data from a document. Data Types have more capabilities than quick_reference_all Value Readers. Data Types can collect results from multiple extractor sources, including a locally defined extractor, child extractor nodes, and referenced extractor nodes. Data Types can also collate results using Collation Providers to combine, sift and manipulate results further.

  • For example, if you're extracting a date that could appear in multiple formats within a document, you'd use various extractor nodes (each capturing a different format) as children of a Data Type.

The Data Type also defines how to collate results from one or more extractors into a referenceable output. The simplest type of collation (Individual) would just return all individual extractors' results as a list of results.

Data Types are also used for recognizing complex 2D data structures, like address blocks or table rows. Different collation methods would be used in these cases to combine results in different ways.

Field Class

input Field Classes are NLP (natural language processing) based extractor nodes. They find values based on some natural language context near that value. Values are positively or negatively associated with text-based "features" nearby by training the extractor. During extraction, the extractor collects values based on these training weightings.

  • Field Classes are most useful when attempting to find values within the flow of natural language.
  • Field Classes can be configured to distinguish values within highly structured documents, but this type of extraction is better suited to simpler "extractor nodes" like quick_reference_all Value Readers or pin Data Types.
  • Advances in large-language models (LLMs) have largely made Field Classes obsolete. LLM-based extraction methods in Grooper (such as AI Extract) can achieve similar results with nowhere near the amount of set up.