2.80:Data Type (Node Type)

From Grooper Wiki
A sample Data Type extractor in the Node Tree

Data Types are Data Extractors that use regular expression to match text on a document, returning and collating the results.

The matching pattern or patterns will return as a list of values. The returned values can be further manipulated, isolated, and adjusted by configuring the properties of the Data Type.

About

Data Type extractors are the main way information is found and used on a document. Say, you want to use the form number information on the document below to separate a document.

You need a Data Type! The Data Type will find the form number. The Separate activity will use that Data Type to separate this page into a new folder.


Say you want to classify this contract as a "Lease" document type, using the header title "Oil and Gas Lease".

You need a Data Type! The Data Type will find the document heading. You can then set up a Content Model and make a rule where if that Data Type finds that heading, the document gets classified as "Lease" during the Classify activity.


Say you want to grab all the highlighted information from this form.

You need a Data Type! You can create a Data Model with fields for the "Production Unit Number", the "Gross Volume", "Taxable Value", and all the other data elements on the page (Technically you need multiple Data Types. One for each data element). You then will point your Data Model to the Data Type extractors that find their corresponding values.

How Data Types Locate Data

Data Types return information on a page by using regular expression pattern matching. Regular expression (or regex) uses a standard syntax to match patterns of text in a block of text. For example, the regular expression "ball" would match every word "ball" in the string of text "ball, football, baseball, 8-ball ball in hand, balloon"

Regex Pattern Text Matches
ball ball, football, baseball, 8-ball ball in hand, balloon ball, football, baseball, 8-ball ball in hand, balloon

This is a very specific pattern literally matching only the string of characters "ball". Regex patterns also take advantage of a specific syntax to match more general patterns. For example, the "\d" character in regex will match any digit character 0 through 9.

For more information on regular expression pattern matching, visit the Regular Expression article.

! Before regex can match text on a document, you have to extract machine readable text from the page! Data Type extractors will return no results without any raw text data to match. You must first obtain text from your documents via the Recognize activity. The Recognize activity will extract machine readable text from images through OCR as well as extracting native text from digital PDFs.

Once you have extracted text for a document via the Recognize activity (either through OCR for image based documents or native text extraction from digital PDFs), Data Type extractors can use regular expression to match text in whatever way you deem necessary. The simplest configuration of a Data Type extractor uses a regular expression pattern (written using the "Pattern" property and the Pattern Editor) to match text on a document and return the matches as individual results.


A simple Data Type extractor returning the match from a simple regex pattern on a simple document.


Data Types are also much more robust than simple regex pattern matching. While regular expression is a huge part of how Data Types return data from a document, it is only the beginning. Two other concepts are critically important to understanding how Data Types work: Inheritance and Collation.

Inheritance

Data Types inherit the values returned by any child extractor created under it (as well as any extractor it references). This allows a single extractor to return multiple values using multiple patterns and extractor configurations.

Data Types can have both Data Format and Data Type extractors as children.

Data Formats are very simple extractors using only

Collation

How the Data Type uses those results will be configured in its properties (Determined by the "Collation" property).

Use Cases

The total number of uses for Data Types are innumerable. However, they fall into three main categories.

Document Separation

Document Classification

Populating a Data Model

How To

Create a New Data Type