Data Extractor (Concept): Difference between revisions

From Grooper Wiki
 
(28 intermediate revisions by 2 users not shown)
Line 1: Line 1:
{{AutoVersion}}
{{AutoVersion}}
<blockquote style="font-size:14pt">
 
"Data extractors" are Grooper objects or property configurations used to isolate and return information from text data on a page.
<blockquote>{{#lst:Glossary|Data Extractor}}</blockquote>
</blockquote>


Data extractors (or simply "extractors") are used in a variety of ways, including (but not limited to):
Data extractors (or simply "extractors") are used in a variety of ways, including (but not limited to):
* '''Collecting data''' - Performed by extractors setup in a {{IconName|Data Model}} [[Data Model]] and its child [[Data Element]]s
* '''Classifying documents''' - Performed by extractors set up on a {{IconName|Content Model}} [[Content Model]] and/or its child {{IconName|Document Type}} [[Document Type]]s.
* '''Separating documents''' - Performed by extractors set up using various [[Separation Provider]]s.


* Classify documents
* Find data on a page you wish to store outside of Grooper
* Separate documents


Extractors are highly configurable in terms of how data is targeted, how it is ordered and sorted, how text is pre-processed, what tolerance the extractor has for "fuzzy" results, how results are post-processed and more.  However, at their core, extractors are simply tools used to parse text data from a larger text source (e.g. a single field from a whole document).
Extractors are highly configurable in terms of how data is targeted, how it is ordered and sorted, how text is pre-processed, what tolerance the extractor has for "fuzzy" results, how results are post-processed and more.  However, at their core, extractors are simply tools used to parse text data from a larger text source (e.g. a single field from a whole document).


== About ==
== Value Extractors ==
<blockquote>{{#lst:Glossary|Value Extractor}}</blockquote>
 
"Value Extractors" are the primitive operators in Grooper that perform "data extraction". They read data values from the text (and sometimes visual content) of a page or document. This is the foundation of Grooper's capability to locate and collect document data, such as dates, numbers, entity names, barcodes, checkboxes, text labels, paragraphs, and more.
 


=== Extractor Types ===
Value Extractors (or "extractors" for short) are executed when various [[Activity|Activities]] run through a [[Batch Process]].
* Example: Extractors can be used to collect a {{IconName|Data Model}} [[Data Model]]'s data during an {{IconName|Extract}} [[Extract]] step.
* Example: Extractors can assist in document classification during a {{IconName|Classify}} [[Classify]] step.
* Example: Extractors can assist in document separation during a {{IconName|Separate}} [[Separate]] step.


Extractors are configured all over the place in Grooper.  There are around 100 different configuration properties that allow you to configure an extractor to return data from a document.  In older versions of Grooper, extraction was limited to simple regular expression pattern matching.  As Grooper has evolved, we have developed a number of different mechanisms to extract information from a page or document.  We call these different kinds of extractors "'''''Extractor Types'''''".


For any extractor property, you can choose one of the '''''Extractor Types''''' available in Grooper. For example, you might use the '''''List Match''''' extractor to match a state on a document from a list of US states. Or, you might use the '''''Pattern Match''''' extractor to extract dates of various date/time formats.
To accomplish these ends, Value Extractors added to and configured in various nodes and their properties in Grooper.
* Example: A {{IconName|Data Field}} [[Data Field]]'s "Value Extractor" property.
* Example: A {{IconName|Value Reader}} [[Value Reader]]'s "Extractor" property.
* Example: Misceleneous {{IconName|Data Type}} [[Data Type]] properties ("Local Extractor", "Input Filter", "Exclusion Extractor" and "Subtraction Extractor")


Currently, there are the following '''''Extractor Types''''' in Grooper:


==== Text Parsing Extractors ====
There are around 100 different configuration properties that allow you to configure an extractor to return data from a document.  There are several different kinds of Value Extractors too. Think of these as specialized tools in your data extraction toolkit.


These '''''Extractor Types''''' primarily rely on regular expression, lists of values (such as a '''Lexicon''' of field labels) or other forms of text parsing to return values  
For any extractor property, you can choose one of Grooper's available Value Extractors (or just "extractor" for short).
* Example: The [[Pattern Match]] extractor returns results that match a regular expression pattern, such as a pattern that matches various date/time formats.
* Example: The [[List Match]] extractor returns results that match items in a list, such as a state name from a list of US sates.
 
=== Value Extractor types in Grooper ===
 
"Value Extractor" is a base class of 24 unique extractor types. This includes:
 
<div style="padding-left: 1.5em">
==== Text parsing extractors ====
These extractors primarily rely on regular expression, lists of values (such as a {{IconName|Lexicon}} Lexicon of field labels) or other forms of text parsing to return values.
:*<li class="fyi-bullet">Please note, regular expression and other forms of text parsing is the "bread and butter" of how Grooper data extraction works.  Other extractors may also utilize regex or other forms of text parsing as part of their configuration.  These extractors just rely on it more heavily.
* [[Pattern Match]]
* [[Pattern Match]]
* [[List Match]]
* [[List Match]]
Line 32: Line 50:
* [[Field Match]]
* [[Field Match]]


{|class="fyi-box"
==== OMR extractors ====
|
'''FYI'''
|
Please note, regular expression and other forms of text parsing is the "bread and butter" of how Grooper data extraction works.  Other '''''Extractor Types''''' may also utilize regex or other forms of text parsing as part of their configuration.  These '''''Extractor Types''''' just rely on it more heavily.
|}
 
==== OMR Extractors ====


These '''''Extractor Types''''' allow you to return values using optical mark recognition. These are useful for extracting values on documents that use checkboxes to detail information.
These extractors allow you to return values using optical mark recognition. These are useful for extracting values on documents that use checkboxes to detail information.
* [[Labeled OMR]]
* [[Labeled OMR]]
* [[Ordered OMR]]
* [[Ordered OMR]]
* [[Zonal OMR]]
* [[Zonal OMR]]


==== Barcode Extractors ====
==== Barcode extractors ====
 
These extractors allow you to return a value encoded in a barcode.
These '''''Extractor Types''''' allow you to return a value encoded in a barcode.
* [[Find Barcode]]
* [[Find Barcode]]
* [[Read Barcode]]
* [[Read Barcode]]


==== Zonal Extractors ====
==== Zonal extractors ====
 
These extractors extract data by drawing a logical rectangle somewhere on a document.  These are useful for extracting values on highly structured documents where field values are consistently located on the same position on the page for every document.
These '''''Extractor Types''''' extract by drawing a logical rectangle somewhere on a document.  These are useful for extracting values on highly structured documents where field values are consistently located on the same position on the page for every document.
* [[Read Zone]]
* [[Read Zone]]
* [[Highlight Zone]]
* [[Highlight Zone]]
* [[Detect Signature]]
* [[Detect Signature]]


==== Miscellaneous ====
==== LLM-based extractors ====
 
These extractors use generative AI to return results. The document text and other prompts defined by the user are fed to a large language model (LLM) for analysis. To utilize these extractors, you must add an "[[LLM Connector]]" Repository Option to the Grooper Root.
 
* [[Ask AI]]
* [[AI Column Extractor]] (experimental)
 
==== Text analysis extractors (experimental) ====
 
These extractors use the [https://learn.microsoft.com/en-us/azure/ai-services/language-service/overview Azure AI Language] cloud service to analyze document text. To utilize these extractors, you must add a "Text Analysis" [[Repository Option]] to the Grooper Root.
:*<li class="attn-bullet"> BE AWARE: These features are still in development and should be considered "experimental". They have not been extensively tested or implemented.
 
* Entity Recognition
* Key Phrase Recognition
* PII Entity Recognition


These '''''Extractor Types''''' have specialized uses and don't fit in well into the other categories.
==== Miscellaneous extractors ====
* [[GPT Complete]]
These extractors have specialized uses and don't fit in well into the other categories.
* [[Query HTML]]
* [[Query HTML]]
* [[Read Meta Data]]
* [[Query XML]]
* [[Read Metadata]]
* [[Select Page]]
 
==== The Reference extractor ====
The Reference extractor is unique among the extractor types. It allows users to reference the results of an extractor node (such as a {{IconName|Data Type}} Data Type or {{IconName|Value Reader}} Value Reader).
* [[Reference]]
* [[Reference]]
</div>
== Extractor Nodes==
<blockquote>{{#lst:Glossary|Extractor Nodes}}</blockquote>
<section begin="Extractor Nodes" />
=== Types of Extractor Nodes ===
There are three types of Extractor Nodes in '''Grooper''':
: {{ValueReaderIcon}} '''[[Value Reader]]'''
: {{DataTypeIcon}} '''[[Data Type]]'''
: {{FieldClassIcon}} '''[[Field Class]]'''
:*<li class="attn-bullet"> Advances in large-language models (LLMs) have largely made Field Classes obsolete. LLM-based extraction methods in Grooper (such as [[AI Extract]]) can achieve similar results with nowhere near the amount of set up.
All three of these node types perform a similar function. They return data from documents. However, they differ in their configuration and utility.
Extractor Nodes are tools to extract/return document data. But they don't do anything by themselves. They are used by extractor properties on other nodes in Grooper.
* Example: When {{IconName|Extract}} runs on a document, [[Data Element]]s (such as {{IconName|Data Field}} [[Data Field]]s) are ultimately used collect document data.
*: ''It is a Data Field's "Value Extractor" property that does this. You may configure this property with an Extractor Node to do so.''
* Example: When executed in a {{IconName|Separate}} [[Separate]] step, the [[Pattern-Based Separation]] provider's is ultimately what identifies patterns to separate Batch Pages into Batch Folders.
*: ''It is its "Value Extractor" property that does this. However, you may configure this property with an Extractor Node to do so.
* Example: When {{IconName|Classify}} [[Classify]] runs on a document, a {{IconName|Document Type}} [[Document Type]]'s "Positive Extractor" property will be used to assign a Batch Folder the Document Type if it returns a value.
*: ''You may configure the Positive Extractor with an Extractor Node to do so.''
* And so on and so on for any extractor property for any node in Grooper.
To that end, Extractor Nodes serve three purposes:
# To be re-usable units of extraction
# To collate data
# To leverage machine learning algorithms to target data in the flow of text
#*<li class="attn-bullet"> Advances in large-language models (LLMs) have largely made Field Classes obsolete. LLM-based extraction methods in Grooper (such as [[AI Extract]]) can achieve similar results with nowhere near the amount of set up.
<div style="padding-left: 1.5em">
=== Re-usability ===
Extractor nodes are meant to be referenced either by other extractor nodes or, importantly, by Data Elements such as '''Data Fields''' in a '''Data Model'''.
For example, an individual '''Data Field''' can be configured on its own to collect a date value, such as the "Received Date" on an invoice. However, what if another '''Data Field''' is collecting a different date format, like the "Due Date" on the same invoice? In this case you would create one extractor node, like a '''Value Reader''', to collect any and all date formats. You could then have each '''Data Field''' reference that ''single'' '''Value Reader''' and further configure each individual '''Data Field''' to differentiate their specific date value.
=== Data collation ===
Another example would be configuring a '''Data Type''' to target entire rows of information within a table of data. Several '''Value Reader''' nodes could be made as children of the '''Data Type''', each targeting a specific value within the table row. The parent '''Data Type''' would then collate the results of its child '''Value Reader''' nodes into one result. A '''Data Table''' would then reference the '''Data Type''' to collect the appropriate rows of information.
=== Machine learning ===
Many documents contain important pieces of information buried within the flow of text, like a legal document. These types of documents and the data they contain require an entirely different approach to extracting data than a highly structured document like an invoice. For these situations you can use a "trainable" extractor known as a '''Field Class''' to leverage machine learning algorithms to target important information.
*<li class="attn-bullet"> Advances in large-language models (LLMs) have largely made Field Classes obsolete. LLM-based extraction methods in Grooper (such as [[AI Extract]]) can achieve similar results with nowhere near the amount of set up.
=== Extractor Nodes vs Value Extractors ===
Extractor nodes should not be confused with "Value Extractors". There are ''many'' places in '''Grooper''' where extraction logic can be applied for one purpose or another. In these cases a Value Extractor is chosen to define the logic required to return a desired value.
In fact, the Extractor Nodes themselves will leverage specific Value Extractors to define their logic.
* Example: "Value Readers" are configured using a single property "Extractor". This property specifies a single Value Extractor which determines how data is extracted. Value Readers are essentially an encapsulation of a single Value Extractor configuration that can be reused by multiple other extraction elements and properties, such as Data Fields and Data Types.
* Example: "Data Types" have several properties that can be configured with Value Extractors, including its "Local Extractor", "Input Filter", and "Exclusion Extractor" properties.
* Example" "Field Classes" cannot function without its "Value Extractor" and "Feature Extractor" properties configured, both of which specify a Value Extractor.
However, Extractor Nodes are used when you need to ''reference'' them for their designated strengths:
* re-usability
* collation
* machine learning
**<li class="attn-bullet"> Advances in large-language models (LLMs) have largely made Field Classes obsolete. LLM-based extraction methods in Grooper (such as [[AI Extract]]) can achieve similar results with nowhere near the amount of set up.
=== Related Node Types ===
<div style="padding-left: 1.5em">
==== Value Reader ====
{{#lst:Glossary|Value Reader}}
==== Data Type ====
{{#lst:Glossary|Data Type}}


=== Extractor Objects ===
* For example, if you're extracting a date that could appear in multiple formats within a document, you'd use various extractor nodes (each capturing a different format) as children of a '''Data Type'''.


Extractor objects are create as Grooper nodes in your '''Project'''.  As objects, these extractors can be referenced by any extractor property, allowing them to be used over and over again by other resources in your '''Project'''.
The '''Data Type''' also defines how to collate results from one or more extractors into a referenceable outputThe simplest type of collation (''Individual'') would just return all individual extractors' results as a list of results.


There are three types of extractor ''objects''
'''Data Types''' are also used for recognizing complex 2D data structures, like address blocks or table rows. Different collation methods would be used in these cases to combine results in different ways.


* '''[[Value Reader]]''' - This is the most basic extractor object, allowing you to configure a single '''''Extractor Type'''''.
==== Field Class ====
* '''[[Data Type]]''' - This extractor object allows you to reference other extractor objects and collate their results using one of Grooper's various [[Collation Provider]]s.
{{#lst:Glossary|Field Class}}
* '''[[Field Class]]''' - This is a special extractor that uses machine learning to return results.
</div>
</div>
<section end="Extractor Nodes" />

Latest revision as of 11:42, 28 August 2025

This article is about the current version of Grooper.

Note that some content may still need to be updated.

2025 2.90

Data Extractor (or just "extractor") refers to all Value Extractors and Extractor Nodes. Extractors define the logic used to return data from a document's text content, including general data (such as a date) and specific data (such as an agreement date on a contract).

Data extractors (or simply "extractors") are used in a variety of ways, including (but not limited to):


Extractors are highly configurable in terms of how data is targeted, how it is ordered and sorted, how text is pre-processed, what tolerance the extractor has for "fuzzy" results, how results are post-processed and more. However, at their core, extractors are simply tools used to parse text data from a larger text source (e.g. a single field from a whole document).

Value Extractors

Value Extractors define an operation that reads data from the text (and sometimes visual) content of a page or document. There are over 20 unique Value Extractors, each using specialized logic to return results. Value Extractors are consumed by multiple higher-level objects in Grooper (such as Data Elements, Extractor Nodes, various Activities and more) to perform a diverse set of document processing duties.

  • Value Extractors return a list of one or more "data instances". Data instances contain both the value and its page location, which allows Grooper to highlight results in a Document Viewer.

"Value Extractors" are the primitive operators in Grooper that perform "data extraction". They read data values from the text (and sometimes visual content) of a page or document. This is the foundation of Grooper's capability to locate and collect document data, such as dates, numbers, entity names, barcodes, checkboxes, text labels, paragraphs, and more.


Value Extractors (or "extractors" for short) are executed when various Activities run through a Batch Process.

  • Example: Extractors can be used to collect a data_table Data Model's data during an export_notes Extract step.
  • Example: Extractors can assist in document classification during a unknown_document Classify step.
  • Example: Extractors can assist in document separation during a insert_page_break Separate step.


To accomplish these ends, Value Extractors added to and configured in various nodes and their properties in Grooper.

  • Example: A variables Data Field's "Value Extractor" property.
  • Example: A quick_reference_all Value Reader's "Extractor" property.
  • Example: Misceleneous pin Data Type properties ("Local Extractor", "Input Filter", "Exclusion Extractor" and "Subtraction Extractor")


There are around 100 different configuration properties that allow you to configure an extractor to return data from a document. There are several different kinds of Value Extractors too. Think of these as specialized tools in your data extraction toolkit.

For any extractor property, you can choose one of Grooper's available Value Extractors (or just "extractor" for short).

  • Example: The Pattern Match extractor returns results that match a regular expression pattern, such as a pattern that matches various date/time formats.
  • Example: The List Match extractor returns results that match items in a list, such as a state name from a list of US sates.

Value Extractor types in Grooper

"Value Extractor" is a base class of 24 unique extractor types. This includes:

Text parsing extractors

These extractors primarily rely on regular expression, lists of values (such as a dictionary Lexicon of field labels) or other forms of text parsing to return values.

  • Please note, regular expression and other forms of text parsing is the "bread and butter" of how Grooper data extraction works. Other extractors may also utilize regex or other forms of text parsing as part of their configuration. These extractors just rely on it more heavily.

OMR extractors

These extractors allow you to return values using optical mark recognition. These are useful for extracting values on documents that use checkboxes to detail information.

Barcode extractors

These extractors allow you to return a value encoded in a barcode.

Zonal extractors

These extractors extract data by drawing a logical rectangle somewhere on a document. These are useful for extracting values on highly structured documents where field values are consistently located on the same position on the page for every document.

LLM-based extractors

These extractors use generative AI to return results. The document text and other prompts defined by the user are fed to a large language model (LLM) for analysis. To utilize these extractors, you must add an "LLM Connector" Repository Option to the Grooper Root.

Text analysis extractors (experimental)

These extractors use the Azure AI Language cloud service to analyze document text. To utilize these extractors, you must add a "Text Analysis" Repository Option to the Grooper Root.

  • BE AWARE: These features are still in development and should be considered "experimental". They have not been extensively tested or implemented.
  • Entity Recognition
  • Key Phrase Recognition
  • PII Entity Recognition

Miscellaneous extractors

These extractors have specialized uses and don't fit in well into the other categories.

The Reference extractor

The Reference extractor is unique among the extractor types. It allows users to reference the results of an extractor node (such as a pin Data Type or quick_reference_all Value Reader).

Extractor Nodes

Types of Extractor Nodes

There are three types of Extractor Nodes in Grooper:

quick_reference_all Value Reader
pin Data Type
input Field Class
  • Advances in large-language models (LLMs) have largely made Field Classes obsolete. LLM-based extraction methods in Grooper (such as AI Extract) can achieve similar results with nowhere near the amount of set up.

All three of these node types perform a similar function. They return data from documents. However, they differ in their configuration and utility.

Extractor Nodes are tools to extract/return document data. But they don't do anything by themselves. They are used by extractor properties on other nodes in Grooper.

  • Example: When export_notes runs on a document, Data Elements (such as variables Data Fields) are ultimately used collect document data.
    It is a Data Field's "Value Extractor" property that does this. You may configure this property with an Extractor Node to do so.
  • Example: When executed in a insert_page_break Separate step, the Pattern-Based Separation provider's is ultimately what identifies patterns to separate Batch Pages into Batch Folders.
    It is its "Value Extractor" property that does this. However, you may configure this property with an Extractor Node to do so.
  • Example: When unknown_document Classify runs on a document, a description Document Type's "Positive Extractor" property will be used to assign a Batch Folder the Document Type if it returns a value.
    You may configure the Positive Extractor with an Extractor Node to do so.
  • And so on and so on for any extractor property for any node in Grooper.


To that end, Extractor Nodes serve three purposes:

  1. To be re-usable units of extraction
  2. To collate data
  3. To leverage machine learning algorithms to target data in the flow of text
    • Advances in large-language models (LLMs) have largely made Field Classes obsolete. LLM-based extraction methods in Grooper (such as AI Extract) can achieve similar results with nowhere near the amount of set up.

Re-usability

Extractor nodes are meant to be referenced either by other extractor nodes or, importantly, by Data Elements such as Data Fields in a Data Model.

For example, an individual Data Field can be configured on its own to collect a date value, such as the "Received Date" on an invoice. However, what if another Data Field is collecting a different date format, like the "Due Date" on the same invoice? In this case you would create one extractor node, like a Value Reader, to collect any and all date formats. You could then have each Data Field reference that single Value Reader and further configure each individual Data Field to differentiate their specific date value.

Data collation

Another example would be configuring a Data Type to target entire rows of information within a table of data. Several Value Reader nodes could be made as children of the Data Type, each targeting a specific value within the table row. The parent Data Type would then collate the results of its child Value Reader nodes into one result. A Data Table would then reference the Data Type to collect the appropriate rows of information.

Machine learning

Many documents contain important pieces of information buried within the flow of text, like a legal document. These types of documents and the data they contain require an entirely different approach to extracting data than a highly structured document like an invoice. For these situations you can use a "trainable" extractor known as a Field Class to leverage machine learning algorithms to target important information.

  • Advances in large-language models (LLMs) have largely made Field Classes obsolete. LLM-based extraction methods in Grooper (such as AI Extract) can achieve similar results with nowhere near the amount of set up.

Extractor Nodes vs Value Extractors

Extractor nodes should not be confused with "Value Extractors". There are many places in Grooper where extraction logic can be applied for one purpose or another. In these cases a Value Extractor is chosen to define the logic required to return a desired value.

In fact, the Extractor Nodes themselves will leverage specific Value Extractors to define their logic.

  • Example: "Value Readers" are configured using a single property "Extractor". This property specifies a single Value Extractor which determines how data is extracted. Value Readers are essentially an encapsulation of a single Value Extractor configuration that can be reused by multiple other extraction elements and properties, such as Data Fields and Data Types.
  • Example: "Data Types" have several properties that can be configured with Value Extractors, including its "Local Extractor", "Input Filter", and "Exclusion Extractor" properties.
  • Example" "Field Classes" cannot function without its "Value Extractor" and "Feature Extractor" properties configured, both of which specify a Value Extractor.

However, Extractor Nodes are used when you need to reference them for their designated strengths:

  • re-usability
  • collation
  • machine learning
    • Advances in large-language models (LLMs) have largely made Field Classes obsolete. LLM-based extraction methods in Grooper (such as AI Extract) can achieve similar results with nowhere near the amount of set up.

Related Node Types

Value Reader

quick_reference_all Value Reader nodes define a single data extraction operation. Each Value Reader executes a single Value Extractor configuration. The Value Extractor determines the logic for returning data from a text-based document or page. (Example: Pattern Match is a Value Extractor that returns data using regular expressions).

  • Value Readers are can be used on their own or in conjunction with pin Data Types for more complex data extraction and collation.

Data Type

pin Data Types are nodes used to extract text data from a document. Data Types have more capabilities than quick_reference_all Value Readers. Data Types can collect results from multiple extractor sources, including a locally defined extractor, child extractor nodes, and referenced extractor nodes. Data Types can also collate results using Collation Providers to combine, sift and manipulate results further.

  • For example, if you're extracting a date that could appear in multiple formats within a document, you'd use various extractor nodes (each capturing a different format) as children of a Data Type.

The Data Type also defines how to collate results from one or more extractors into a referenceable output. The simplest type of collation (Individual) would just return all individual extractors' results as a list of results.

Data Types are also used for recognizing complex 2D data structures, like address blocks or table rows. Different collation methods would be used in these cases to combine results in different ways.

Field Class

input Field Classes are NLP (natural language processing) based extractor nodes. They find values based on some natural language context near that value. Values are positively or negatively associated with text-based "features" nearby by training the extractor. During extraction, the extractor collects values based on these training weightings.

  • Field Classes are most useful when attempting to find values within the flow of natural language.
  • Field Classes can be configured to distinguish values within highly structured documents, but this type of extraction is better suited to simpler "extractor nodes" like quick_reference_all Value Readers or pin Data Types.
  • Advances in large-language models (LLMs) have largely made Field Classes obsolete. LLM-based extraction methods in Grooper (such as AI Extract) can achieve similar results with nowhere near the amount of set up.