Data Extractor (Concept): Difference between revisions

Revision as of 12:56, 28 December 2023

This article is about the current version of Grooper.

Note that some content may still need to be updated.

2025

Data Extractors are Grooper objects or property configurations used to isolate and return information from text data on a page.

Data extractors (or simply "extractors") are used in a variety of ways, including (but not limited to):

Classify documents
Find data on a page you wish to store outside of Grooper
Separate documents

Extractors are highly configurable in terms of how data is targeted, how it is ordered and sorted, how text is pre-processed, what tolerance the extractor has for "fuzzy" results, how results are post-processed and more. However, at their core, extractors are simply tools used to parse text data from a larger text source (e.g. a single field from a whole document).

About

Extractor Types

Extractors are configured all over the place in Grooper. There are around 100 different configuration properties that allow you to configure an extractor to return data from a document. In older versions of Grooper, extraction was limited to simple regular expression pattern matching. As Grooper has evolved, we have developed a number of different mechanisms to extract information from a page or document. We call these different kinds of extractors "Extractor Types".

For any extractor property, you can choose one of the Extractor Types available in Grooper. For example, you might use the List Match extractor to match a state on a document from a list of US states. Or, you might use the Pattern Match extractor to extract dates of various date/time formats.

Currently, there are the following Extractor Types in Grooper:

Extractor Objects

Extractor objects are create as Grooper nodes in your Project. As objects, these extractors can be referenced by any extractor property, allowing them to be used over and over again by other resources in your Project.

There are three types of extractor objects:

Value Reader - This is the most basic extractor object, allowing you to configure a single Extractor Type.
Data Type - This extractor object allows you to reference other extractor objects and collate their results using one of Grooper's various Collation Providers.
Field Class - This is a special extractor that uses machine learning to return results.

@@ Line 9: / Line 9: @@
 * Find data on a page you wish to store outside of Grooper
 * Separate documents
+Extractors are highly configurable in terms of how data is targeted, how it is ordered and sorted, how text is pre-processed, what tolerance the extractor has for "fuzzy" results, how results are post-processed and more.  However, at their core, extractors are simply tools used to parse text data from a larger text source (e.g. a single field from a whole document).
 == About ==