Extractor Node: Difference between revisions

From Grooper Wiki
Tag: New redirect
 
(One intermediate revision by the same user not shown)
Line 1: Line 1:
<blockquote>{{#lst:Glossary|Extractor Node}}</blockquote>
#REDIRECT [[Data Extractor (Concept)#Extractor Nodes]]
 
== About ==
<section begin="Extractor Nodes" />
=== Types of Extractor Nodes ===
There are three types of Extractor Nodes in '''Grooper''':
: {{ValueReaderIcon}} '''[[Value Reader]]'''
: {{DataTypeIcon}} '''[[Data Type]]'''
: {{FieldClassIcon}} '''[[Field Class]]''' ''Uncommonly used in current Grooper implementations''
All three of these node types perform a similar function. They are configured to return data from documents. However, they differ in their configuration and data extraction purpose.
 
Extractor nodes are tools to extract/return data. Ultimately, "[[Data Element]]s" are what collects data. They may ''use'' extractor nodes to help collect data in a '''Data Model'''.
 
To that end, extractor nodes serve three purposes:
# To be re-usable units of extraction
# To collate data
# To leverage machine learning algorithms to target data in the flow of text
<div style="padding-left: 1.5em">
=== Re-usability ===
Extractor nodes are meant to be referenced either by other extractor nodes or, importantly, by Data Elements such as '''Data Fields''' in a '''Data Model'''.
 
For example, an individual '''Data Field''' can be configured on its own to collect a date value, such as the "Received Date" on an invoice. However, what if another '''Data Field''' is collecting a different date format, like the "Due Date" on the same invoice? In this case you would create one extractor node, like a '''Value Reader''', to collect any and all date formats. You could then have each '''Data Field''' reference that ''single'' '''Value Reader''' and further configure each individual '''Data Field''' to differentiate their specific date value.
 
=== Data collation ===
Another example would be configuring a '''Data Type''' to target entire rows of information within a table of data. Several '''Value Reader''' nodes could be made as children of the '''Data Type''', each targeting a specific value within the table row. The parent '''Data Type''' would then collate the results of its child '''Value Reader''' nodes into one result. A '''Data Table''' would then reference the '''Data Type''' to collect the appropriate rows of information.
 
=== Machine learning ===
Many documents contain important pieces of information buried within the flow of text, like a legal document. These types of documents and the data they contain require an entirely different approach to extracting data than a highly structured document like an invoice. For these situations you can use a "trainable" extractor known as a '''Field Class''' to leverage machine learning algorithms to target important information.
*<li class="attn-bullet"> Advances in large-language models (LLMs) have largely made Field Classes obsolete. LLM-based extraction methods in Grooper (such as [[AI Extract]]) can achieve similar results with nowhere near the amount of set up.
 
=== Extractor Nodes vs Value Extractors ===
Extractor nodes should not be confused with "Value Extractors". There are ''many'' places in '''Grooper''' where extraction logic can be applied for one purpose or another. In these cases a Value Extractor is chosen to define the logic required to return a desired value.
 
In fact, the Extractor Nodes themselves will leverage specific Value Extractors to define their logic.
* Example: "Value Readers" are configured using a single property "Extractor". This property specifies a single Value Extractor which determines how data is extracted. Value Readers are essentially an encapsulation of a single Value Extractor configuration that can be reused by multiple other extraction elements and properties, such as Data Fields and Data Types.
* Example: "Data Types" have several properties that can be configured with Value Extractors, including its "Local Extractor", "Input Filter", and "Exclusion Extractor" properties.
* Example" "Field Classes" cannot function without its "Value Extractor" and "Feature Extractor" properties configured, both of which specify a Value Extractor.
 
However, Extractor Nodes are used when you need to ''reference'' them for their designated strengths:
* re-usability
* collation
* machine learning
**<li class="attn-bullet"> Advances in large-language models (LLMs) have largely made Field Classes obsolete. LLM-based extraction methods in Grooper (such as [[AI Extract]]) can achieve similar results with nowhere near the amount of set up.
 
=== Related Node Types ===
<div style="padding-left: 1.5em">
==== Value Reader ====
{{#lst:Glossary|Value Reader}}
 
==== Data Type ====
{{#lst:Glossary|Data Type}}
 
* For example, if you're extracting a date that could appear in multiple formats within a document, you'd use various extractor nodes (each capturing a different format) as children of a '''Data Type'''.
 
The '''Data Type''' also defines how to collate results from one or more extractors into a referenceable output.  The simplest type of collation (''Individual'') would just return all individual extractors' results as a list of results.
 
'''Data Types''' are also used for recognizing complex 2D data structures, like address blocks or table rows. Different collation methods would be used in these cases to combine results in different ways.
 
==== Field Class ====
{{#lst:Glossary|Field Class}}
</div>
</div>
<section end="Extractor Nodes" />

Latest revision as of 11:36, 27 August 2025