Data Extractor (Concept): Difference between revisions

Revision as of 15:43, 24 April 2024

This article is about the current version of Grooper.

Note that some content may still need to be updated.

2025

Data Extractor (or just "extractor") refers to all Extractor Types and extractor node objects. Extractors define the logic used to return data from a document's text content, including general data (such as a date) and specific data (such as an agreement date on a contract).

Data extractors (or simply "extractors") are used in a variety of ways, including (but not limited to):

Classify documents
Find data on a page you wish to store outside of Grooper
Separate documents

Extractors are highly configurable in terms of how data is targeted, how it is ordered and sorted, how text is pre-processed, what tolerance the extractor has for "fuzzy" results, how results are post-processed and more. However, at their core, extractors are simply tools used to parse text data from a larger text source (e.g. a single field from a whole document).

Extractor Types

Extractors are configured all over the place in Grooper. There are around 100 different configuration properties that allow you to configure an extractor to return data from a document. In older versions of Grooper, extraction was limited to simple regular expression pattern matching. As Grooper has evolved, we have developed a number of different mechanisms to extract information from a page or document. We call these different kinds of extractors "Extractor Types".

For any extractor property, you can choose one of the Extractor Types available in Grooper. For example, you might use the List Match extractor to match a state on a document from a list of US states. Or, you might use the Pattern Match extractor to extract dates of various date/time formats.

Currently, there are the following Extractor Types in Grooper:

Text Parsing Extractors

These Extractor Types primarily rely on regular expression, lists of values (such as a Lexicon of field labels) or other forms of text parsing to return values

FYI

Please note, regular expression and other forms of text parsing is the "bread and butter" of how Grooper data extraction works. Other Extractor Types may also utilize regex or other forms of text parsing as part of their configuration. These Extractor Types just rely on it more heavily.

OMR Extractors

These Extractor Types allow you to return values using optical mark recognition. These are useful for extracting values on documents that use checkboxes to detail information.

Barcode Extractors

These Extractor Types allow you to return a value encoded in a barcode.

Zonal Extractors

These Extractor Types extract by drawing a logical rectangle somewhere on a document. These are useful for extracting values on highly structured documents where field values are consistently located on the same position on the page for every document.

Miscellaneous Extractors

These Extractor Types have specialized uses and don't fit in well into the other categories.

Extractor Objects

@@ Line 12: / Line 12: @@
 == Extractor Types ==
+{{#lst:Extractor Type (Property)|Extractor Types}}
-Extractors are configured all over the place in Grooper.  There are around 100 different configuration properties that allow you to configure an extractor to return data from a document.  In older versions of Grooper, extraction was limited to simple regular expression pattern matching.  As Grooper has evolved, we have developed a number of different mechanisms to extract information from a page or document.  We call these different kinds of extractors "'''''Extractor Types'''''".
-For any extractor property, you can choose one of the '''''Extractor Types''''' available in Grooper.  For example, you might use the '''''List Match''''' extractor to match a state on a document from a list of US states.  Or, you might use the '''''Pattern Match''''' extractor to extract dates of various date/time formats.
-Currently, there are the following '''''Extractor Types''''' in Grooper:
-=== Text Parsing Extractors ===
-These '''''Extractor Types''''' primarily rely on regular expression, lists of values (such as a '''Lexicon''' of field labels) or other forms of text parsing to return values
-* [[Pattern Match]]
-* [[List Match]]
-* [[Label Match]]
-* [[Word Match]]
-* [[Labeled Value]]
-* [[Field Match]]
-{|class="fyi-box"
-|
-'''FYI'''
-|
-Please note, regular expression and other forms of text parsing is the "bread and butter" of how Grooper data extraction works.  Other '''''Extractor Types''''' may also utilize regex or other forms of text parsing as part of their configuration.  These '''''Extractor Types''''' just rely on it more heavily.
-|}
-=== OMR Extractors ===
-These '''''Extractor Types''''' allow you to return values using optical mark recognition. These are useful for extracting values on documents that use checkboxes to detail information.
-* [[Labeled OMR]]
-* [[Ordered OMR]]
-* [[Zonal OMR]]
-=== Barcode Extractors ===
-These '''''Extractor Types''''' allow you to return a value encoded in a barcode.
-* [[Find Barcode]]
-* [[Read Barcode]]
-=== Zonal Extractors ===
-These '''''Extractor Types''''' extract by drawing a logical rectangle somewhere on a document.  These are useful for extracting values on highly structured documents where field values are consistently located on the same position on the page for every document.
-* [[Read Zone]]
-* [[Highlight Zone]]
-* [[Detect Signature]]
-=== Miscellaneous Extractors ===
-These '''''Extractor Types''''' have specialized uses and don't fit in well into the other categories.
-* [[GPT Complete]]
-* [[Query HTML]]
-* [[Read Meta Data]]
-* [[Reference]]
 == Extractor Objects ==
 {{#lst:Object Nomenclature|Extractor Objects}}