Data Extractor (Concept): Difference between revisions

From Grooper Wiki
No edit summary
Line 17: Line 17:
<section begin="Extractor Objects" />
<section begin="Extractor Objects" />
There are three types of "Extractor Objects" in '''Grooper''':
There are three types of "Extractor Objects" in '''Grooper''':
: [[image:GrooperIcon_ValueReader.png]] '''[[Value Reader]]'''
: {{ValueReaderIcon}} '''[[Value Reader]]'''
: [[image:GrooperIcon_DataType.png]] '''[[Data Type]]'''
: {{DataTypeIcon}} '''[[Data Type]]'''
: [[image:GrooperIcon_FieldClass.png]] '''[[Field Class]]'''
: {{FieldClassIcon}} '''[[Field Class]]'''
All three of these objects perform a similar function. They are objects that are configured to return data from documents. However, they differ in their configuration and data extraction purpose.
All three of these objects perform a similar function. They are objects that are configured to return data from documents. However, they differ in their configuration and data extraction purpose.



Revision as of 09:17, 28 August 2024

This article is about the current version of Grooper.

Note that some content may still need to be updated.

2025 2.90

Data Extractor (or just "extractor") refers to all Value Extractors and Extractor Nodes. Extractors define the logic used to return data from a document's text content, including general data (such as a date) and specific data (such as an agreement date on a contract).

Data extractors (or simply "extractors") are used in a variety of ways, including (but not limited to):

  • Classify documents
  • Find data on a page you wish to store outside of Grooper
  • Separate documents

Extractors are highly configurable in terms of how data is targeted, how it is ordered and sorted, how text is pre-processed, what tolerance the extractor has for "fuzzy" results, how results are post-processed and more. However, at their core, extractors are simply tools used to parse text data from a larger text source (e.g. a single field from a whole document).

Extractor Types

Extractor Type (Property)

Extractor Objects

There are three types of "Extractor Objects" in Grooper:

quick_reference_all Value Reader
pin Data Type
input Field Class

All three of these objects perform a similar function. They are objects that are configured to return data from documents. However, they differ in their configuration and data extraction purpose.

"Extractor Objects" are tools to extract/return data. Ultimately, "Data Elements" are what collects data. They may use extractor objects to help collect data in a Data Model.

To that end, extractor objects serve three purposes:

  1. To be re-usable units of extraction
  2. To collate data
  3. To leverage machine learning algorithms to target data in the flow of text

Re-Usability

"Extractor Objects" are meant to be referenced either by other "Extractor Objects", or more importantly, by "Data Elements". For example, an individual Data Field can be configured on its own to collect a date value, such as the "Received Date" on an invoice. However, what if another Data Field is collectig a different date format, like the "Due Date" on the same invoice? In this case you would create one "Extractor Object", like a Value Reader, to collect any and all date formats. You could then have each Data Field reference that one Value Reader and further configure each individual Data Field to differentiate their specific date value.

Data Collation

Another example would be configuring a Data Type to target entire rows of information within a table of data. Several Value Reader "Extractor Objects" could be made as children of the Data Type, each targeting a specific value within the table row. The parent Data Type would then collate the results of its child Value Reader "Extractor Objects" into one result. A Data Table would then reference the Data Type to collect the appropriate rows of information.

Machine Learning

Many documents contain important pieces of information buried within the flow of text, like a legal document. These types of documents and the data they contain require an entirely different approach to extracting data than a highly structured document like an invoice. For these situations you can use a "trainable" "Extractor Object" known as a Field Class to leverage machine learning algorithms to target important information.

Extractor Objects vs Value Extractors

"Extractor Objects" should not be confused with "Value Extractors". There are many places in Grooper where extraction logic can be applied for one purpose or another. In these cases a "Value Extractor" is chosen to define the logic required to return a desired value. In fact, the "Extractor Objects" themselves each leverage specific "Value Extractors" to define their logic.

"Value Extractor" examples:

  • Pattern-Match uses regular expressions to return results.
  • Labeled OMR uses a regex and computer vision to return results for checkboxes.
  • Other "Value Extractors" may use a combination of "Value Extractors" that work together to return results in specific ways.
    • The Labeled Value "Value Extractor" defines a "Value Extractor" for both its Label Extractor and Value Extractor properties.

However, "Extractor Objects" are used when you need to reference them for their designated strengths:

  • re-usbaility
  • collation
  • machine learning

Related Objects

Value Reader

quick_reference_all Value Reader nodes define a single data extraction operation. Each Value Reader executes a single Value Extractor configuration. The Value Extractor determines the logic for returning data from a text-based document or page. (Example: Pattern Match is a Value Extractor that returns data using regular expressions).

  • Value Readers are can be used on their own or in conjunction with pin Data Types for more complex data extraction and collation.

Data Type

pin Data Types are nodes used to extract text data from a document. Data Types have more capabilities than quick_reference_all Value Readers. Data Types can collect results from multiple extractor sources, including a locally defined extractor, child extractor nodes, and referenced extractor nodes. Data Types can also collate results using Collation Providers to combine, sift and manipulate results further.

  • For example, if you're extracting a date that could appear in multiple formats within a document, you'd use various "Extractor Objects" (each capturing a different format) as children of a Data Type.

The Data Type also defines how to collate results from one or more extractors into a referenceable output. The simplest type of collation (Individual) would just return all individual extractors' results as a list of results.

Data Types are also used for recognizing complex 2D data structures, like address blocks or table rows. Different collation methods would be used in these cases to combine results in different ways.

Field Class

input Field Classes are NLP (natural language processing) based extractor nodes. They find values based on some natural language context near that value. Values are positively or negatively associated with text-based "features" nearby by training the extractor. During extraction, the extractor collects values based on these training weightings.

  • Field Classes are most useful when attempting to find values within the flow of natural language.
  • Field Classes can be configured to distinguish values within highly structured documents, but this type of extraction is better suited to simpler "extractor nodes" like quick_reference_all Value Readers or pin Data Types.
  • Advances in large-language models (LLMs) have largely made Field Classes obsolete. LLM-based extraction methods in Grooper (such as AI Extract) can achieve similar results with nowhere near the amount of set up.


Glossary

Classify: unknown_document Classify is an Activity that "classifies" folder Batch Folders in a inventory_2 Batch by assigning them a description Document Type.

  • Classification is key to Grooper's document processing. It affects how data is extracted from a document (during the Extract activity) and how Behaviors are applied.
  • Classification logic is controlled by a Content Model's "Classify Method". These methods include using text patterns, previously trained document examples, and Label Sets to identify documents.

Data Element: Data Elements are a class of node types used to collect data from a document. These include: data_table Data Models, insert_page_break Data Sections, variables Data Fields, table Data Tables, and view_column Data Columns.

Data Extractor: Data Extractor (or just "extractor") refers to all Value Extractors and Extractor Nodes. Extractors define the logic used to return data from a document's text content, including general data (such as a date) and specific data (such as an agreement date on a contract).

Data Field: variables Data Fields represent a single value targeted for data extraction on a document. Data Fields are created as child nodes of a data_table Data Model and/or insert_page_break Data Sections.

  • Data Fields are frequently referred to simply as "fields".

Data Model: data_table Data Models are leveraged during the Extract activity to collect data from documents (folder Batch Folders). Data Models are the root of a Data Element hierarchy. The Data Model and its child Data Elements define a schema for data present on a document. The Data Model's configuration (and its child Data Elements' configuration) define data extraction logic and settings for how data is reviewed in a Data Viewer.

Data Table: A table Data Table is a Data Element specialized in extracting tabular data from documents (i.e. data formatted in rows and columns).

  • The Data Table itself defines the "Table Extract Method". This is configured to determine the logic used to locate and return the table's rows.
  • The table's columns are defined by adding view_column Data Column nodes to the Data Table (as its children).

Data Type: pin Data Types are nodes used to extract text data from a document. Data Types have more capabilities than quick_reference_all Value Readers. Data Types can collect results from multiple extractor sources, including a locally defined extractor, child extractor nodes, and referenced extractor nodes. Data Types can also collate results using Collation Providers to combine, sift and manipulate results further.

Detect Signature: Detect Signature is a Value Extractor that cant detect if a handwritten signature is present on a document. It detects signatures within a specified rectangular region on a document page by measuring the "fill percentage" (what percentage of pixels are filled in the region).

Extract: export_notes Extract is an Activity that retrieves information from folder Batch Folder documents, as defined by Data Elements in a data_table Data Model. This is how Grooper locates unstructured data on your documents and collects it in a structured, usable format.

Extractor Type:

Field Class: input Field Classes are NLP (natural language processing) based extractor nodes. They find values based on some natural language context near that value. Values are positively or negatively associated with text-based "features" nearby by training the extractor. During extraction, the extractor collects values based on these training weightings.

  • Field Classes are most useful when attempting to find values within the flow of natural language.
  • Field Classes can be configured to distinguish values within highly structured documents, but this type of extraction is better suited to simpler "extractor nodes" like quick_reference_all Value Readers or pin Data Types.
  • Advances in large-language models (LLMs) have largely made Field Classes obsolete. LLM-based extraction methods in Grooper (such as AI Extract) can achieve similar results with nowhere near the amount of set up.

Field Match: Field Match is a Value Extractor that matches the value stored in a previously-extracted variables Data Field or view_column Data Column.

Find Barcode: Find Barcode is a Value Extractor that searches for and returns barcode values previously stored in a folder Batch Folder or contract Batch Page's layout data.

  • Note: Find Barcode differs slightly from Read Barcode. Read Barcode performs barcode recognition when the extractor executes. Find Barcode can only look up barcode data stored in the document or page's layout data. Find Barcode runs quicker than Read Barcode, but barcode values must have previously been collected in the Batch Process by the Image Processing or Recognize activities.

GPT Complete: GPT Complete is a Value Extractor that leverages Open AI's GPT models to generate chat completions for inputs, returning one hit for each result choice provided by the model's response.

PLEASE NOTE: GPT Complete is a deprecated Value Extractor. It uses an outdated method to call the OpenAI API. Please use the Ask AI extractor going forward.

Highlight Zone: Highlight Zone is a Value Extractor that sets a highlight region on a document without performing any actual data extraction. This "extractor" is used to mark areas of interest or importance for Review users or for uncommon scenarios where a data instance location is needed with no actual value.

Label Match: Label Match is a Value Extractor that matches a list of one or more values using matching options defined by a Labeling Behavior. It is similar to List Match but uses shared settings defined in a Labeling Behavior for Fuzzy Matching, Vertical Wrap, and Constrained Wrap.

Labeled OMR: Labeled OMR is a Value Extractor used to output OMR checkbox labels. It determines whether labeled checkboxes are checked or not. If checked, it outputs the label(s) or a Boolean true/false value as the result.

Labeled Value: Labeled Value is a Value Extractor that identifies and extracts a value next to a label. This is one of the most commonly used extractors to extract data from structured documents (such as a standardized form) and static values on semi-structured documents (such as the header details on an invoice).

Lexicon: dictionary Lexicons are dictionaries used throughout Grooper to store lists of words, phrases, weightings for Fuzzy RegEx, and more. Users can add entries to a Lexicon, Lexicons can import entries from other Lexicons by referencing them, and entries can be dynamically imported from a database using a database Data Connection. Lexicons are commonly used to aid in data extraction, with the "List Match" and "Word Match" extractors utilizing them most commonly.

List Match: List Match is a Value Extractor designed to return values matching one or more items in a defined list. By default, the List Match extractor does not use or require regular expression, but can be configured to utilize regular expression syntax.

Machine: computer Machine nodes represent servers that have connected to the Grooper Repository. They are essential for distributing task processing loads across multiple servers. Grooper creates Machine nodes automatically whenever a server makes a new connection to a Grooper Repository's database. Once added, Machine nodes can be used to view server information and to manage Grooper Service instances.

Ordered OMR: Ordered OMR is a Value Extractor used to return OMR check box information. Ordered OMR returns information for multiple check boxes within a defined zone based on their order and layout. The zone may be optionally fixed on the page or anchored to a static text value (such as a label).

Pattern Match: Pattern Match is a Value Extractor that extracts values from a document that match a specified regular expression, providing data collection following a known format or pattern.

Query HTML: Query HTML is a Value Extractor specialized for HTML documents. It uses either CSS or XPath selectors to return the inner text or an attribute of an HTML element.

Read Barcode: Read Barcode is a Value Extractor that uses barcode recognition technology to read and extract values from barcodes found in the document content.

  • Note: Read Barcode differs slightly from Find Barcode. Read Barcode performs barcode recognition when the extractor executes. Find Barcode can only look up barcode data stored in the document or page's layout data. Find Barcode runs quicker than Read Barcode, but barcode values must have previously been collected in the Batch Process by the Image Processing or Recognize activities.

Read Meta Data: Read Metadata is a Value Extractor retrieves metadata values associated with a document. Read Metadata can return metadata from a folder Batch Folder's attachment file based on its MIME type, such as PDF, Word and Mail Message ('message/rfc822' or 'application/vnd.ms-outlook'). It can also return data using a Document Link in Grooper, such as a File System Link or a CMIS Document Link.

Read Zone: Read Zone is a Value Extractor that allows you to extract text data in a rectangular region (called an "extraction zone" or just "zone") on a document. This can be a fixed zone, extracting text from the same location on a document, or a zone relative to a text value (such as a label) or a shape location on the document.

Reference: Reference is a Value Extractor used to reference an Extractor Node. This allows users to create re-usable extractors and use the more complex pin Data Type and input Field Class extractors throughout Grooper.

Separate: insert_page_break Separate is an Activity that sorts contract Batch Pages into individual folder Batch Folders. This distinguishes "loose pages" from the documents formed by those pages. Once loose pages are separated into Batch Folder documents, they can be further processed by unknown_document Classify, export_notes Extract, output Export and other Activities that need to run on the folder (i.e. document) level.

Value Reader: quick_reference_all Value Reader nodes define a single data extraction operation. Each Value Reader executes a single Value Extractor configuration. The Value Extractor determines the logic for returning data from a text-based document or page. (Example: Pattern Match is a Value Extractor that returns data using regular expressions).

  • Value Readers are can be used on their own or in conjunction with pin Data Types for more complex data extraction and collation.

Word Match: Word Match is a Value Extractor that extracts individual words or phrases from documents. It is used for n-gram extraction. Each gram may be optionally executed against a dictionary Lexicon to ensure words and phrases only match a set vocabulary.

Zonal OMR: Zonal OMR is a Value Extractor that reads one or more OMR checkboxes using manually-configured zones. The zone may be optionally fixed on the page or anchored to a static text value (such as a label).

BE AWARE: Zonal OMR is outdated compared to Labeled OMR and Ordered OMR. It requires the most manual setup of any OMR extractor to configure. Use this as a last resort when other OMR extractor options have been exhausted.