2.80:Data Type (Node Type)

A sample Data Type extractor in the Node Tree

Data Types are Data Extractors that use regular expression to match text on a document, returning and collating the results.

The matching pattern or patterns will return as a list of values. The returned values can be further manipulated, isolated, and adjusted by configuring the properties of the Data Type.

About

Data Type extractors are the main way information is found and used on a document. Say, you want to use the form number information on the document below to separate a document.

You need a Data Type! The Data Type will find the form number. The Separate activity will use that Data Type to separate this page into a new folder.

Say you want to classify this contract as a "Lease" document type, using the header title "Oil and Gas Lease".

You need a Data Type! The Data Type will find the document heading. You can then set up a Content Model and make a rule where if that Data Type finds that heading, the document gets classified as "Lease" during the Classify activity.

Say you want to grab all the highlighted information from this form.

You need a Data Type! You can create a Data Model with fields for the "Production Unit Number", the "Gross Volume", "Taxable Value", and all the other data elements on the page (Technically you need multiple Data Types. One for each data element). You then will point your Data Model to the Data Type extractors that find their corresponding values.

How Data Types Locate Data

Data Types return information on a page by using regular expression pattern matching. Regular expression (or regex) uses a standard syntax to match patterns of text in a block of text. For example, the regular expression "ball" would match every word "ball" in the string of text "ball, football, baseball, 8-ball ball in hand, balloon"

Regex Pattern	Text	Matches
`ball`	ball, football, baseball, 8-ball ball in hand, balloon	ball, football, baseball, 8-ball ball in hand, balloon

This is a very specific pattern literally matching only the string of characters "ball". Regex patterns also take advantage of a specific syntax to match more general patterns. For example, the "\d" character in regex will match any digit character 0 through 9.

For more information on regular expression pattern matching, visit the Regular Expression article.

!

Before regex can match text on a document, you have to extract machine readable text from the page! Data Type extractors will return no results without any raw text data to match. You must first obtain text from your documents via the Recognize activity. The Recognize activity will extract machine readable text from images through OCR as well as extracting native text from digital PDFs.

Once you have extracted text for a document via the Recognize activity (either through OCR for image based documents or native text extraction from digital PDFs), Data Type extractors can use regular expression to match text in whatever way you deem necessary. The simplest configuration of a Data Type extractor uses a regular expression pattern (written using the "Pattern" property and the Pattern Editor) to match text on a document and return the matches as individual results.

A simple Data Type extractor returning the match from a simple regex pattern on a simple document.

Data Types are also much more robust than simple regex pattern matching. While regular expression is a huge part of how Data Types return data from a document, it is only the beginning. Two other concepts are critically important to understanding how Data Types work: Inheritance and Collation.

Inheritance

Data Types inherit the values returned by any child extractor created under it (as well as any extractor it references). This allows a single extractor to return multiple values using multiple patterns and extractor configurations.

Data Types can have both Data Format and Data Type extractors as children.

For example, the extractor below has two "Data Format" children. One finds the word "HELLO". The other finds the word "WORLD!". Both results are returned by the parent Data Type.

Data Format Extractors

Data Formats are very simple extractors. They are only created as children of Data Type extractors. They cannot be created as a free-standing object. They are bitty baby objects that need to hold mommy's hand.

They too use regular expression to return matches against the raw text data. They are configured using only the Pattern Editor and the properties available to the Pattern Editor.

Data Format extractors are useful for patterning multiple varieties in which a data can be formatted. Think about the different ways in which a date can be formatted.

These are all different ways to express the same information.

06/12/1985
June 12, 1985
12 June 1985
1985-06-12
12th day of June 1985

It would be difficult to match each one of these five date formats using a single regular expression. However, it's relatively easy to match each format with five different regex patterns.

Data Types as Children of Data Types

Data Type extractors can also be children of other Data Type extractors. Any result the child Data Type returns will be fed to the parent Data Type. This includes the results of child Data Types own children! This way, the child Data Type can take advantage of the properties available to Data Type objects not available to Data Formats, such as collation (more on collation below).

See below, the parent Data Type named "Sample Data Type" has three children. Two Data Formats and one Data Type. Every result each three child extractors returns are returned by the parent Data Type.

Referenced Extractors

Instead of creating a Data Type as a direct child of another Data Type, you can also reference Data Types in the Node Tree to return their result. Functionally, the parent Data Type uses the reference as if it were a child without changing the child Data Type's location in the Node Tree.

This can be very helpful from an asset management perspective. When a Data Type's results need to be used by multiple different parent Data Types, there's no need to create multiple separate child Data Types for each parent. Instead, a single Data Type can be created as its own object in the Node Tree and all parent Data Types can reference the same object as if it were a child, using the "Referenced Extractors" property.

Collation

How the Data Type uses those results will be configured in its properties (Determined by the "Collation" property).

Use Cases

The total number of uses for Data Types are innumerable. However, they fall into three main categories.

2.80:Data Type (Node Type)

About

How Data Types Locate Data

Inheritance

Data Format Extractors

Data Types as Children of Data Types

Referenced Extractors

Collation

Use Cases

Document Separation

Document Classification

Populating a Data Model

How To

Create a New Data Type

Create a Child Data Format or Data Type

Reference an Extractor on a Data Type