Data Extractor (Concept): Difference between revisions

From Grooper Wiki
Line 15: Line 15:


== Extractor Objects ==
== Extractor Objects ==
{{#lst:Object Nomenclature|Extractor Objects}}
<section begin="Extractor Objects" />
There are three types of "Extractor Objects" in '''Grooper''':
: [[image:GrooperIcon_ValueReader.png]] '''[[Value Reader]]'''
: [[image:GrooperIcon_DataType.png]] '''[[Data Type]]'''
: [[image:GrooperIcon_FieldClass.png]] '''[[Field Class]]'''
All three of these objects perform a similar function. They are objects that are configured to return data from documents. However, they differ in their configuration and data extraction purpose.
 
"Extractor Objects" are tools to extract/return data. Ultimately, "Data Elements" are what collects data. They may ''use'' extractor objects to help collect data in a '''Data Model'''.
 
To that end, extractor objects serve three purposes:
# To be re-usable units of extraction
# To collate data
# To leverage machine learning algorithms to target data in the flow of text
 
=== Re-Usability ===
"Extractor Objects" are meant to be referenced either by other "Extractor Objects", or more importantly, by "Data Elements". For example, an individual '''Data Field''' can be configured on its own to collect a date value, such as the "Received Date" on an invoice. However, what if another '''Data Field''' is collectig a different date format, like the "Due Date" on the same invoice? In this case you would create one "Extractor Object", like a '''Value Reader''', to collect any and all date formats. You could then have each '''Data Field''' reference that ''one'' '''Value Reader''' and further configure each individual '''Data Field''' to differentiate their specific date value.
 
=== Data Collation ===
Another example would be configuring a '''Data Type''' to target entire rows of information within a table of data. Several '''Value Reader''' "Extractor Objects" could be made as children of the '''Data Type''', each targeting a specific value within the table row. The parent '''Data Type''' would then collate the results of its child '''Value Reader''' "Extractor Objects" into one result. A '''Data Table''' would then reference the '''Data Type''' to collect the appropriate rows of information.
 
=== Machine Learning ===
Many documents contain important pieces of information buried within the flow of text, like a legal document. These types of documents and the data they contain require an entirely different approach to extracting data than a highly structured document like an invoice. For these situations you can use a "trainable" "Extractor Object" known as a '''Field Class''' to leverage machine learning algorithms to target important information.
 
=== Extractor Objects vs Value Extractors ===
"Extractor Objects" should not be confused with "Value Extractors". There are ''many'' places in '''Grooper''' where extraction logic can be applied for one purpose or another. In these cases a "Value Extractor" is chosen to define the logic required to return a desired value. In fact, the "Extractor Objects" themselves each leverage specific "Value Extractors" to define their logic.
 
"Value Extractor" examples:
* ''Pattern-Match'' uses regular expressions to return results.
* ''Labeled OMR'' uses a regex and computer vision to return results for checkboxes.
* Other "Value Extractors" may use a combination of "Value Extractors" that work together to return results in specific ways.
** The ''Labeled Value'' "Value Extractor" defines a "Value Extractor" for both its '''''Label Extractor''''' and '''''Value Extractor''''' properties.
 
However, "Extractor Objects" are used when you need to ''reference'' them for their designated strengths:
* re-usbaility
* collation
* machine learning
 
=== Related Objects ===
==== Value Reader ====
[[image:GrooperIcon_ValueReader.png]] '''[[Value Reader|Value Readers]]''' define a single extraction operation. You set the type of extractor on the '''Value Reader''' that matches the specific data you're aiming to capture. For example, you would use the ''Pattern-Match'' "Value Extractor" to return data using regular expression. You would use a '''Value Reader''' when you need to extract a single result or list of simple results from a document.
 
==== Data Type ====
[[image:GrooperIcon_DataType.png]] '''[[Data Type|Data Types]]''' in '''Grooper''' hold a collection of extractors and settings that manage how multiple matches from extractors are consolidated into a result set.
* For example, if you're extracting a date that could appear in multiple formats within a document, you'd use various "Extractor Objects" (each capturing a different format) as children of a '''Data Type'''.
 
The '''Data Type''' also defines how to collate results from one or more extractors into a referenceable output.  The simplest type of collation (''Individual'') would just return all individual extractors' results as a list of results.
 
'''Data Types''' are also used for recognizing complex 2D data structures, like address blocks or table rows. Different collation methods would be used in these cases to combine results in different ways.
 
==== Field Class ====
[[image:GrooperIcon_FieldClass.png]] '''[[Field Class|Field Classes]]''' are trainable extractors that distinguish between multiple instances of similar data within a document by understanding the context in which they occur. '''Field Classes''' ''can'' be configured to distinguish values within highly structured documents, but this type of extraction is better suited to simpler "Extractor Objects" like '''Value Readers''' or '''Data Types'''.
 
'''Field Classes''' are most useful when attempting to find values within the flow of natural language. This method involves training with positive and negative examples to distinguish the right context. You'd opt for a '''Field Class''' when the value you're after is an entire clause within a contract, or a specific value defined within the flow of text.
<section end="Extractor Objects" />

Revision as of 10:48, 26 April 2024

This article is about the current version of Grooper.

Note that some content may still need to be updated.

2025 2.90

Data Extractor (or just "extractor") refers to all Value Extractors and Extractor Nodes. Extractors define the logic used to return data from a document's text content, including general data (such as a date) and specific data (such as an agreement date on a contract).

Data extractors (or simply "extractors") are used in a variety of ways, including (but not limited to):

  • Classify documents
  • Find data on a page you wish to store outside of Grooper
  • Separate documents

Extractors are highly configurable in terms of how data is targeted, how it is ordered and sorted, how text is pre-processed, what tolerance the extractor has for "fuzzy" results, how results are post-processed and more. However, at their core, extractors are simply tools used to parse text data from a larger text source (e.g. a single field from a whole document).

Extractor Types

Extractor Type (Property)

Extractor Objects

There are three types of "Extractor Objects" in Grooper:

Value Reader
Data Type
Field Class

All three of these objects perform a similar function. They are objects that are configured to return data from documents. However, they differ in their configuration and data extraction purpose.

"Extractor Objects" are tools to extract/return data. Ultimately, "Data Elements" are what collects data. They may use extractor objects to help collect data in a Data Model.

To that end, extractor objects serve three purposes:

  1. To be re-usable units of extraction
  2. To collate data
  3. To leverage machine learning algorithms to target data in the flow of text

Re-Usability

"Extractor Objects" are meant to be referenced either by other "Extractor Objects", or more importantly, by "Data Elements". For example, an individual Data Field can be configured on its own to collect a date value, such as the "Received Date" on an invoice. However, what if another Data Field is collectig a different date format, like the "Due Date" on the same invoice? In this case you would create one "Extractor Object", like a Value Reader, to collect any and all date formats. You could then have each Data Field reference that one Value Reader and further configure each individual Data Field to differentiate their specific date value.

Data Collation

Another example would be configuring a Data Type to target entire rows of information within a table of data. Several Value Reader "Extractor Objects" could be made as children of the Data Type, each targeting a specific value within the table row. The parent Data Type would then collate the results of its child Value Reader "Extractor Objects" into one result. A Data Table would then reference the Data Type to collect the appropriate rows of information.

Machine Learning

Many documents contain important pieces of information buried within the flow of text, like a legal document. These types of documents and the data they contain require an entirely different approach to extracting data than a highly structured document like an invoice. For these situations you can use a "trainable" "Extractor Object" known as a Field Class to leverage machine learning algorithms to target important information.

Extractor Objects vs Value Extractors

"Extractor Objects" should not be confused with "Value Extractors". There are many places in Grooper where extraction logic can be applied for one purpose or another. In these cases a "Value Extractor" is chosen to define the logic required to return a desired value. In fact, the "Extractor Objects" themselves each leverage specific "Value Extractors" to define their logic.

"Value Extractor" examples:

  • Pattern-Match uses regular expressions to return results.
  • Labeled OMR uses a regex and computer vision to return results for checkboxes.
  • Other "Value Extractors" may use a combination of "Value Extractors" that work together to return results in specific ways.
    • The Labeled Value "Value Extractor" defines a "Value Extractor" for both its Label Extractor and Value Extractor properties.

However, "Extractor Objects" are used when you need to reference them for their designated strengths:

  • re-usbaility
  • collation
  • machine learning

Related Objects

Value Reader

Value Readers define a single extraction operation. You set the type of extractor on the Value Reader that matches the specific data you're aiming to capture. For example, you would use the Pattern-Match "Value Extractor" to return data using regular expression. You would use a Value Reader when you need to extract a single result or list of simple results from a document.

Data Type

Data Types in Grooper hold a collection of extractors and settings that manage how multiple matches from extractors are consolidated into a result set.

  • For example, if you're extracting a date that could appear in multiple formats within a document, you'd use various "Extractor Objects" (each capturing a different format) as children of a Data Type.

The Data Type also defines how to collate results from one or more extractors into a referenceable output. The simplest type of collation (Individual) would just return all individual extractors' results as a list of results.

Data Types are also used for recognizing complex 2D data structures, like address blocks or table rows. Different collation methods would be used in these cases to combine results in different ways.

Field Class

Field Classes are trainable extractors that distinguish between multiple instances of similar data within a document by understanding the context in which they occur. Field Classes can be configured to distinguish values within highly structured documents, but this type of extraction is better suited to simpler "Extractor Objects" like Value Readers or Data Types.

Field Classes are most useful when attempting to find values within the flow of natural language. This method involves training with positive and negative examples to distinguish the right context. You'd opt for a Field Class when the value you're after is an entire clause within a contract, or a specific value defined within the flow of text.