Data Extractor (Concept)

This article is about the current version of Grooper.

Note that some content may still need to be updated.

2025

Data Extractor (or just "extractor") refers to all Value Extractors and Extractor Nodes. Extractors define the logic used to return data from a document's text content, including general data (such as a date) and specific data (such as an agreement date on a contract).

Data extractors (or simply "extractors") are used in a variety of ways, including (but not limited to):

Classify documents
Find data on a page you wish to store outside of Grooper
Separate documents

Extractors are highly configurable in terms of how data is targeted, how it is ordered and sorted, how text is pre-processed, what tolerance the extractor has for "fuzzy" results, how results are post-processed and more. However, at their core, extractors are simply tools used to parse text data from a larger text source (e.g. a single field from a whole document).

Extractor Types

Extractor Type (Property)

Extractor Nodes

There are three types of "extractor nodes" in Grooper:

quick_reference_all Value Reader

pin Data Type

input Field Class

All three of these node types perform a similar function. They are configured to return data from documents. However, they differ in their configuration and data extraction purpose.

Extractor nodes are tools to extract/return data. Ultimately, "Data Elements" are what collects data. They may use extractor nodes to help collect data in a Data Model.

To that end, extractor nodes serve three purposes:

To be re-usable units of extraction
To collate data
To leverage machine learning algorithms to target data in the flow of text

Re-usability

Extractor nodes are meant to be referenced either by other extractor nodes or, importantly, by Data Elements such as Data Fields in a Data Model.

For example, an individual Data Field can be configured on its own to collect a date value, such as the "Received Date" on an invoice. However, what if another Data Field is collecting a different date format, like the "Due Date" on the same invoice? In this case you would create one extractor node, like a Value Reader, to collect any and all date formats. You could then have each Data Field reference that single Value Reader and further configure each individual Data Field to differentiate their specific date value.

Data collation

Another example would be configuring a Data Type to target entire rows of information within a table of data. Several Value Reader nodes could be made as children of the Data Type, each targeting a specific value within the table row. The parent Data Type would then collate the results of its child Value Reader nodes into one result. A Data Table would then reference the Data Type to collect the appropriate rows of information.

Machine learning

Many documents contain important pieces of information buried within the flow of text, like a legal document. These types of documents and the data they contain require an entirely different approach to extracting data than a highly structured document like an invoice. For these situations you can use a "trainable" extractor known as a Field Class to leverage machine learning algorithms to target important information.

Extractor nodes vs Value Extractors

Extractor nodes should not be confused with "Value Extractors". There are many places in Grooper where extraction logic can be applied for one purpose or another. In these cases a Value Extractor is chosen to define the logic required to return a desired value. In fact, the extractor nodes themselves each leverage specific Value Extractors to define their logic.

Value Extractor examples:

"Pattern-Match" uses regular expressions to return results.
"Labeled OMR" uses a regex and computer vision to return results for checkboxes.
Other Value Extractors may use a combination of Value Extractors that work together to return results in specific ways.
- The Labeled Value is a Value Extractor that defines its own set of Value Extractors: One for both its Label Extractor and Value Extractor properties.

However, extractor nodes are used when you need to reference them for their designated strengths:

re-usability
collation
machine learning

Related Node Types

Value Reader

quick_reference_all Value Reader nodes define a single data extraction operation. Each Value Reader executes a single Value Extractor configuration. The Value Extractor determines the logic for returning data from a text-based document or page. (Example: Pattern Match is a Value Extractor that returns data using regular expressions).

Value Readers are can be used on their own or in conjunction with pin Data Types for more complex data extraction and collation.

Data Type

pin Data Types are nodes used to extract text data from a document. Data Types have more capabilities than quick_reference_all Value Readers. Data Types can collect results from multiple extractor sources, including a locally defined extractor, child extractor nodes, and referenced extractor nodes. Data Types can also collate results using Collation Providers to combine, sift and manipulate results further.

For example, if you're extracting a date that could appear in multiple formats within a document, you'd use various extractor nodes (each capturing a different format) as children of a Data Type.

The Data Type also defines how to collate results from one or more extractors into a referenceable output. The simplest type of collation (Individual) would just return all individual extractors' results as a list of results.

Data Types are also used for recognizing complex 2D data structures, like address blocks or table rows. Different collation methods would be used in these cases to combine results in different ways.

Field Class

input Field Classes are NLP (natural language processing) based extractor nodes. They find values based on some natural language context near that value. Values are positively or negatively associated with text-based "features" nearby by training the extractor. During extraction, the extractor collects values based on these training weightings.

Field Classes are most useful when attempting to find values within the flow of natural language.
Field Classes can be configured to distinguish values within highly structured documents, but this type of extraction is better suited to simpler "extractor nodes" like quick_reference_all Value Readers or pin Data Types.
Advances in large-language models (LLMs) have largely made Field Classes obsolete. LLM-based extraction methods in Grooper (such as AI Extract) can achieve similar results with nowhere near the amount of set up.