Data Extractor (Concept)

From Grooper Wiki
Revision as of 14:10, 16 April 2024 by Randallkinard (talk | contribs) (Randallkinard moved page Data Extractor to Data Extractor (Concept): new naming convention)

This article is about the current version of Grooper.

Note that some content may still need to be updated.

2025 2.90

"Data extractors" are Grooper objects or property configurations used to isolate and return information from text data on a page.

Data extractors (or simply "extractors") are used in a variety of ways, including (but not limited to):

  • Classify documents
  • Find data on a page you wish to store outside of Grooper
  • Separate documents

Extractors are highly configurable in terms of how data is targeted, how it is ordered and sorted, how text is pre-processed, what tolerance the extractor has for "fuzzy" results, how results are post-processed and more. However, at their core, extractors are simply tools used to parse text data from a larger text source (e.g. a single field from a whole document).

Extractor Types

Extractors are configured all over the place in Grooper. There are around 100 different configuration properties that allow you to configure an extractor to return data from a document. In older versions of Grooper, extraction was limited to simple regular expression pattern matching. As Grooper has evolved, we have developed a number of different mechanisms to extract information from a page or document. We call these different kinds of extractors "Extractor Types".

For any extractor property, you can choose one of the Extractor Types available in Grooper. For example, you might use the List Match extractor to match a state on a document from a list of US states. Or, you might use the Pattern Match extractor to extract dates of various date/time formats.

Currently, there are the following Extractor Types in Grooper:

Text Parsing Extractors

These Extractor Types primarily rely on regular expression, lists of values (such as a Lexicon of field labels) or other forms of text parsing to return values

FYI

Please note, regular expression and other forms of text parsing is the "bread and butter" of how Grooper data extraction works. Other Extractor Types may also utilize regex or other forms of text parsing as part of their configuration. These Extractor Types just rely on it more heavily.

OMR Extractors

These Extractor Types allow you to return values using optical mark recognition. These are useful for extracting values on documents that use checkboxes to detail information.

Barcode Extractors

These Extractor Types allow you to return a value encoded in a barcode.

Zonal Extractors

These Extractor Types extract by drawing a logical rectangle somewhere on a document. These are useful for extracting values on highly structured documents where field values are consistently located on the same position on the page for every document.

Miscellaneous Extractors

These Extractor Types have specialized uses and don't fit in well into the other categories.

Extractor Objects

Extractor objects are create as Grooper nodes in your Project. As objects, these extractors can be referenced by any extractor property, allowing them to be used over and over again by other resources in your Project.

There are three types of extractor objects:

  • Value Reader - This is the most basic extractor object, allowing you to configure a single Extractor Type.
  • Data Type - This extractor object allows you to reference other extractor objects and collate their results using one of Grooper's various Collation Providers.
  • Field Class - This is a special extractor that uses machine learning to return results.