Data Extractor (Concept)

From Grooper Wiki
Revision as of 11:16, 20 June 2024 by Randallkinard (talk | contribs) (added indentation // via Wikitext Extension for VSCode)

This article is about the current version of Grooper.

Note that some content may still need to be updated.

2025 2.90

Data Extractor (or just "extractor") refers to all Extractor Types and extractor node objects. Extractors define the logic used to return data from a document's text content, including general data (such as a date) and specific data (such as an agreement date on a contract).

Data extractors (or simply "extractors") are used in a variety of ways, including (but not limited to):

  • Classify documents
  • Find data on a page you wish to store outside of Grooper
  • Separate documents

Extractors are highly configurable in terms of how data is targeted, how it is ordered and sorted, how text is pre-processed, what tolerance the extractor has for "fuzzy" results, how results are post-processed and more. However, at their core, extractors are simply tools used to parse text data from a larger text source (e.g. a single field from a whole document).

Glossary

Classify: unknown_document Classify is an Activity that "classifies" folder Batch Folders in a inventory_2 Batch by assigning them a Content Type (e.g. a description Document Type) using patterns, lexical understanding, or rules as defined by a stacks Content Model.

Data Element: Data Element refers to the objects in Grooper used to collect data from a document. These include: data_table Data Models, insert_page_break Data Sections, variables Data Fields, table Data Tables, and view_column Data Columns.

Data Extractor: Data Extractor (or just "extractor") refers to all Extractor Types and extractor node objects. Extractors define the logic used to return data from a document's text content, including general data (such as a date) and specific data (such as an agreement date on a contract).

Data Field: variables Data Field node objects are created as child objects of a data_table Data Model. A Data Field is a representation of a single piece of data targeted for extraction on a document.

Data Fields are frequently referred to simply as "fields".

Data Model: data_table Data Model node objects serve as the top-tier structure defining the taxonomy for Data Elements and are leveraged during the Extract Activity to extract data from a folder Batch Folders. They are a hierarchy of Data Elements that sets the stage for the extraction logic and review of data collected from documents.

Data Table: table Data Table objects are utilized for extracting repeating data that's formatted in rows and columns, allowing for complex multi-instance data organization that would be present in table-formatted content.

Data Type: pin Data Type objects hold a collection of child, referenced, and locally defined Data Extractors and settings that manage how multiple (even differing) matches from Data Extractors are consolidated (via Collation) into a result set.

Detect Signature: Detect Signature is an Extractor Type that cant detect if a handwritten signature is present on a document. It detects signatures within a specified rectangular region on a document page by measuring the "fill percentage" (what percentage of pixels are filled in the region).

Extract: export_notes Extract is an Activity that retrieves information from folder Batch Folder documents, as defined by Data Elements in a data_table Data Model. This is how Grooper locates unstructured data on your documents and collects it in a structured, usable format.

Extractor Type: An Extractor Type (shorthand for Value Extractor Type) is configured for numerous properties on a wide array of Grooper objects. They are used to return "data instances" from documents for one purpose or another. The Extractor Type defines an operation that reads data from the text or visual content of a document and returns one or more results. Each different Extractor Type uses a specialized logic to return results. Extractor Types are consumed by higher-level objects such as Data Elements, extractor objects, Content Types and more.

Field Class: input Field Class node objects are used to find values based on some natural language context near that value. Values are positively or negatively associated with text-based "features" nearby by training the extractor. During extraction, the extractor collects values based on these training weightings.

  • Field Classes are most useful when attempting to find values within the flow of natural language.
  • Field Classes can be configured to distinguish values within highly structured documents, but this type of extraction is better suited to simpler "Extractor Objects" like quick_reference_all Value Readers or pin Data Types.

Field Match: Field Match is an Extractor Type that matches the value stored in a previously-extracted variables Data Field or view_column Data Column.

Find Barcode: Find Barcode is an Extractor Type that searches for and returns barcode values previously stored in a folder Batch Folder or contract Batch Page's layout data.

Note: Find Barcode differs slightly from Read Barcode. Read Barcode performs barcode recognition when the extractor executes. Find Barcode can only look up barcode data stored in the document or page's layout data. Find Barcode runs quicker than Read Barcode, but barcode values must have previously been collected in the Batch Process by the Image Processing or Recognize activities.

GPT Complete: GPT Complete is an Extractor Type that leverages Open AI's GPT models to generate chat completions for inputs, returning one hit for each result choice provided by the model's response.

PLEASE NOTE: GPT Complete is a deprecated extractor type. It uses an outdated method to call the OpenAI API. Please use the Ask AI extractor type going forward.

Highlight Zone: Highlight Zone is an Extractor Type that sets a highlight region on a document without performing any actual data extraction. This "extractor" is used to mark areas of interest or importance for Review users or for uncommon scenarios where a data instance location is needed with no actual value.

Label Match: Label Match is an Extractor Type that matches a list of one or more values using matching options defined by a Labeling Behavior. It is similar to List Match but uses shared settings defined in a Labeling Behavior for Fuzzy Matching, Vertical Wrap, and Constrained Wrap.

Labeled OMR: Labeled OMR is an Extractor Type used to output OMR checkbox labels. It determines whether labeled checkboxes are checked or not. If checked, it outputs the label(s) or a Boolean true/false value as the result.

Labeled Value: Labeled Value is an Extractor Type that identifies and extracts a value next to a label. This is one of the most commonly used extractors to extract data from structured documents (such as a standardized form) and static values on semi-structured documents (such as the header details on an invoice).

Lexicon: dictionary Lexicon node objects are dictionary objects that store a list of keys or key-value pairs. Lexicons can define local entries and/or import entries from other Lexicons and even import entries using a Data Connection. The entries in a Lexicon can be utilized in different areas of Grooper, such as data extraction, fuzzy matching, or OCR correction, providing a reference point that enhances the accuracy and consistency of the software's operations.

List Match: List Match is an Extractor Type designed to return values matching one or more items in a defined list. By default, the List Match extractor does not use or require regular expression, but can be configured to utilize regular expression syntax.

Machine: computer Machine node objects represent servers that have connected to the Grooper Repository. They allow for the management of Grooper Service instances and serve as a connection points for processing jobs to be executed on the server hardware. Machine objects are essential for the scaling of processing capabilities and for distributing processing loads across multiple servers.

Ordered OMR: Ordered OMR is an Extractor Type used to return OMR check box information. Ordered OMR returns information for multiple check boxes within a defined zone based on their order and layout. The zone may be optionally fixed on the page or anchored to a static text value (such as a label).

Pattern Match: Pattern Match is an Extractor Type that extracts values from a document that match a specified regular expression, providing data collection following a known format or pattern.

Query HTML: Query HTML is an Extractor Type specialized for HTML documents. It uses either CSS or XPath selectors to return the inner text or an attribute of an HTML element.

Read Barcode: Read Barcode is an Extractor Type that uses barcode recognition technology to read and extract values from barcodes found in the document content.

Note: Read Barcode differs slightly from Find Barcode. Read Barcode performs barcode recognition when the extractor executes. Find Barcode can only look up barcode data stored in the document or page's layout data. Find Barcode runs quicker than Read Barcode, but barcode values must have previously been collected in the Batch Process by the Image Processing or Recognize activities.

Read Meta Data: Read Meta Data is an Extractor Type retrieves metadata values associated with a document. Read Meta Data can return metadata from a folder Batch Folder's attachment file based on its MIME type, such as PDF, Word and Mail Message ('message/rfc822' or 'application/vnd.ms-outlook'). It can also return data using a Document Link in Grooper, such as a File System Link or a CMIS Document Link.

Read Zone: Read Zone is an Extractor Type that allows you to extract text data in a rectangular region (called an "extraction zone" or just "zone") on a document. This can be a fixed zone, extracting text from the same location on a document, or a zone relative to a text value (such as a label) or a shape location on the document.

Reference: Reference is an Extractor Type used to reference an external extractor object within a Grooper property configuration. This allows users to create re-usable extractors and use the more complex pin Data Type and input Field Class extractors throughout Grooper.

Separate: insert_page_break Separate is an Activity that sorts contract Batch Pages into individual folder Batch Folders. This distinguishes "loose pages" from the documents formed by those pages. Once loose pages are separated into Batch Folder documents, they can be further processed by unknown_document Classify, export_notes Extract, output Export and other Activities that need to run on the folder (i.e. document) level.

Value Reader: quick_reference_all Value Reader objects define a single data extraction operation. You set the Extractor Type on the Value Reader that matches the specific data you're aiming to capture. For example, you would use the Pattern Match Extractor Type to return data using regular expression. You would use a Value Reader when you need to extract a single result or list of simple results from a document.

Word Match: Word Match is an Extractor Type that extracts individual words or phrases from documents. It used for n-gram extraction. Each gram may be optionally executed against a dictionary Lexicon to ensure words and phrases only match a set vocabulary.

Zonal OMR: Zonal OMR is an Extractor Type that reads one or more OMR checkboxes using manually-configured zones. The zone may be optionally fixed on the page or anchored to a static text value (such as a label).

BE AWARE: Zonal OMR is outdated compared to Labeled OMR and Ordered OMR. It requires the most manual setup of any OMR extractor to configure. Use this as a last resort when other OMR extractor options have been exhausted.

Extractor Types

Extractors are configured all over the place in Grooper. There are around 100 different configuration properties that allow you to configure an extractor to return data from a document. In older versions of Grooper, extraction was limited to simple regular expression pattern matching. As Grooper has evolved, we have developed a number of different mechanisms to extract information from a page or document. We call these different kinds of extractors "Extractor Types".

For any extractor property, you can choose one of the Extractor Types available in Grooper. For example, you might use the List Match extractor to match a state on a document from a list of US states. Or, you might use the Pattern Match extractor to extract dates of various date/time formats.

Currently, there are the following Extractor Types in Grooper:

Text Parsing Extractors

These Extractor Types primarily rely on regular expression, lists of values (such as a Lexicon of field labels) or other forms of text parsing to return values

FYI

Please note, regular expression and other forms of text parsing is the "bread and butter" of how Grooper data extraction works. Other Extractor Types may also utilize regex or other forms of text parsing as part of their configuration. These Extractor Types just rely on it more heavily.

OMR Extractors

These Extractor Types allow you to return values using optical mark recognition. These are useful for extracting values on documents that use checkboxes to detail information.

Barcode Extractors

These Extractor Types allow you to return a value encoded in a barcode.

Zonal Extractors

These Extractor Types extract by drawing a logical rectangle somewhere on a document. These are useful for extracting values on highly structured documents where field values are consistently located on the same position on the page for every document.

Miscellaneous Extractors

These Extractor Types have specialized uses and don't fit in well into the other categories.


Extractor Objects

There are three types of "Extractor Objects" in Grooper:

Value Reader
Data Type
Field Class

All three of these objects perform a similar function. They are objects that are configured to return data from documents. However, they differ in their configuration and data extraction purpose.

"Extractor Objects" are tools to extract/return data. Ultimately, "Data Elements" are what collects data. They may use extractor objects to help collect data in a Data Model.

To that end, extractor objects serve three purposes:

  1. To be re-usable units of extraction
  2. To collate data
  3. To leverage machine learning algorithms to target data in the flow of text

Re-Usability

"Extractor Objects" are meant to be referenced either by other "Extractor Objects", or more importantly, by "Data Elements". For example, an individual Data Field can be configured on its own to collect a date value, such as the "Received Date" on an invoice. However, what if another Data Field is collectig a different date format, like the "Due Date" on the same invoice? In this case you would create one "Extractor Object", like a Value Reader, to collect any and all date formats. You could then have each Data Field reference that one Value Reader and further configure each individual Data Field to differentiate their specific date value.

Data Collation

Another example would be configuring a Data Type to target entire rows of information within a table of data. Several Value Reader "Extractor Objects" could be made as children of the Data Type, each targeting a specific value within the table row. The parent Data Type would then collate the results of its child Value Reader "Extractor Objects" into one result. A Data Table would then reference the Data Type to collect the appropriate rows of information.

Machine Learning

Many documents contain important pieces of information buried within the flow of text, like a legal document. These types of documents and the data they contain require an entirely different approach to extracting data than a highly structured document like an invoice. For these situations you can use a "trainable" "Extractor Object" known as a Field Class to leverage machine learning algorithms to target important information.

Extractor Objects vs Value Extractors

"Extractor Objects" should not be confused with "Value Extractors". There are many places in Grooper where extraction logic can be applied for one purpose or another. In these cases a "Value Extractor" is chosen to define the logic required to return a desired value. In fact, the "Extractor Objects" themselves each leverage specific "Value Extractors" to define their logic.

"Value Extractor" examples:

  • Pattern-Match uses regular expressions to return results.
  • Labeled OMR uses a regex and computer vision to return results for checkboxes.
  • Other "Value Extractors" may use a combination of "Value Extractors" that work together to return results in specific ways.
    • The Labeled Value "Value Extractor" defines a "Value Extractor" for both its Label Extractor and Value Extractor properties.

However, "Extractor Objects" are used when you need to reference them for their designated strengths:

  • re-usbaility
  • collation
  • machine learning

Related Objects

Value Reader

quick_reference_all Value Reader objects define a single data extraction operation. You set the Extractor Type on the Value Reader that matches the specific data you're aiming to capture. For example, you would use the Pattern Match Extractor Type to return data using regular expression. You would use a Value Reader when you need to extract a single result or list of simple results from a document.

Data Type

pin Data Type objects hold a collection of child, referenced, and locally defined Data Extractors and settings that manage how multiple (even differing) matches from Data Extractors are consolidated (via Collation) into a result set.

  • For example, if you're extracting a date that could appear in multiple formats within a document, you'd use various "Extractor Objects" (each capturing a different format) as children of a Data Type.

The Data Type also defines how to collate results from one or more extractors into a referenceable output. The simplest type of collation (Individual) would just return all individual extractors' results as a list of results.

Data Types are also used for recognizing complex 2D data structures, like address blocks or table rows. Different collation methods would be used in these cases to combine results in different ways.

Field Class

input Field Class node objects are used to find values based on some natural language context near that value. Values are positively or negatively associated with text-based "features" nearby by training the extractor. During extraction, the extractor collects values based on these training weightings.

  • Field Classes are most useful when attempting to find values within the flow of natural language.
  • Field Classes can be configured to distinguish values within highly structured documents, but this type of extraction is better suited to simpler "Extractor Objects" like quick_reference_all Value Readers or pin Data Types.