Data Type

From Grooper Wiki
Jump to navigation Jump to search
A sample Data Type extractor in the Node Tree

Data Types are Data Extractors that use regular expression to match text on a document, returning and collating the results.

The matching pattern or patterns will return as a list of values. The returned values can be further collated, isolated, and manipulated by configuring the properties of the Data Type. Data Types have a variety of uses in Grooper. Not only are they used to extract text to populate Data Fields in a Data Model, but can be used to separate pages into document folders, classify documents, and more.

About

Data Type extractors are the main way information is found and used on a document. Say, you want to use the form number information on the document below to separate loose pages into documents.

Data type 2.png
You need a Data Type! The Data Type will find the form number. The Separate activity will use that Data Type to separate this page into a new folder.


Say you want to classify this contract as a "Lease" document type, using the header title "Oil and Gas Lease".

Data type 3.png
You need a Data Type! The Data Type will find the document heading. You can then set up a Content Model and make a rule where if that Data Type finds that heading, the document gets classified as "Lease" during the Classify activity.


Say you want to grab all the highlighted information from this form.

Data type 4.png
You need a Data Type! You can create a Data Model with fields for the "Production Unit Number", the "Gross Volume", "Taxable Value", and all the other data elements on the page (Technically you need multiple Data Types. One for each data element). You then will point your Data Model to the Data Type extractors that find their corresponding values.

How Data Types Locate Data

Data Types return information on a page by using regular expression pattern matching. Regular expression (or regex) uses a standard syntax to match patterns of text in a block of text. For example, the regular expression "ball" would match every word "ball" in the string of text "ball, football, baseball, 8-ball ball in hand, balloon"

Regex Pattern Text Matches
ball ball, football, baseball, 8-ball ball in hand, balloon ball, football, baseball, 8-ball ball in hand, balloon

This is a very specific pattern literally matching only the string of characters "ball". Regex patterns also take advantage of a specific syntax to match more general patterns. For example, the "\d" character in regex will match any digit character 0 through 9.

For more information on regular expression pattern matching, visit the Regular Expression article.

! Before regex can match text on a document, you have to extract machine readable text from the page! Data Type extractors will return no results without any raw text data to match. You must first obtain text from your documents via the Recognize activity. The Recognize activity will extract machine readable text from images through OCR as well as extracting native text from digital PDFs.

Once you have extracted text for a document via the Recognize activity (either through OCR for image based documents or native text extraction from digital PDFs), Data Type extractors can use regular expression to match text in whatever way you deem necessary. The simplest configuration of a Data Type extractor uses a regular expression pattern (written using the "Pattern" property and the Pattern Editor) to match text on a document and return the matches as individual results.


A simple Data Type extractor returning the match from a simple regex pattern on a simple document.

Data type 5.png


Data Types are also much more robust than simple regex pattern matching. While regular expression is a huge part of how Data Types return data from a document, it is only the beginning. Two other concepts are critically important to understanding how Data Types work: Inheritance and Collation.

Inheritance

Data Types inherit the values returned by any child extractor created under it (as well as any extractor it references). This allows a single extractor to return multiple values using multiple patterns and extractor configurations.

Data Types can have both Data Format and Data Type extractors as children.

Data type 6.png

For example, the extractor below has two "Data Format" children. One finds the word "HELLO". The other finds the word "WORLD!". Both results are returned by the parent Data Type.


Data type 7.png


Data Format Extractors

Data Formats are very simple extractors. They are only created as children of Data Type extractors. They cannot be created as a free-standing object. They are bitty baby objects that need to hold mommy's hand.

They too use regular expression to return matches against the raw text data. They are configured using only the Pattern Editor and the properties available to the Pattern Editor.


Data type 8.png


Data type 9.png


Data Format extractors are useful for patterning multiple varieties in which a data can be formatted. Think about the different ways in which a date can be formatted.

These are all different ways to express the same information.

  • 06/12/1985
  • June 12, 1985
  • 12 June 1985
  • 1985-06-12
  • 12th day of June 1985

It would be difficult to match each one of these five date formats using a single regular expression. However, it's relatively easy to match each format with five different regex patterns.


Data type 10.png


FYI The "Pattern" property on a Data Type also brings up the Pattern Editor to enter a regex pattern. It's the same Pattern Editor with the same "Properties" panel as Data Formats. In fact, you could consider the "Pattern" property as the "primary" Data Format. If you created a Data Format with the same regex pattern and properties set in the "Pattern" property as the first child extractor, it would behave exactly as if it was created using the "Pattern" property.

Data Types as Children of Data Types

Data Type extractors can also be children of other Data Type extractors. Any result the child Data Type returns will be fed to the parent Data Type. This includes the results of child Data Types own children! This way, the child Data Type can take advantage of the properties available to Data Type objects not available to Data Formats, such as collation (more on collation below).

See below, the parent Data Type named "Sample Data Type" has three children. Two Data Formats and one Data Type. Every result each three child extractors returns are returned by the parent Data Type.


Data type 11.png


Referenced Extractors

Instead of creating a Data Type as a direct child of another Data Type, you can also reference Data Types in the Node Tree to return their result. Functionally, the parent Data Type uses the reference as if it were a child without changing the child Data Type's location in the Node Tree.


Data type 12.png


This can be very helpful from an asset management perspective. When a Data Type's results need to be used by multiple different parent Data Types, there's no need to create multiple separate child Data Types for each parent. Instead, a single Data Type can be created as its own object in the Node Tree and all parent Data Types can reference the same object as if it were a child, using the "Referenced Extractors" property.

Collation

One of the main benefits to Data Type extractors is their ability to manipulate data through various collation providers. As much as a Data Type is a "data finder", they are also a "data collator".

For example, our "Sample Data Type" extractor has two children, one to find the word "HELLO" on the page and one to find the word "WORLD!". The default collation type is "Individual", which returns the individual results of the parent Data Type and all its children.


Data type 13.png


But, what if we don't want the two separate words returned but the whole phrase "HELLO WORLD!"? That's where collation providers come into play. For instance, the "Combine" collation provider will combine all results into a single result.


Data type 14.png


Furthermore, each collation provider has its own set of configurable properties. You can see in the example above the words "HELLO" and "WORLD!" were added together as one string without spaces, yielding the result "HELLOWORLD!" This is because there is no space character at the end of the string "HELLO". We can however add a space, using the "Collation" sub-property "Result Separator". Entering a space character here will insert a space character between each combined result.


Data type 15.png


There are nine different Collation Providers available to Data Types:

Each one manipulates or organizes results returned by the Data Type and its children in different ways. For more information on each provider, visit the provider type's full article.

Use Cases

The total number of uses for Data Types are quite large. Essentially, any time you need to extract text data to store a value or use that value to do something, you need a Data Type. However, they fall into three main categories.

Document Separation

Data Types are used by various Separation Providers to determine at what point document folders are created in a Batch.

For example, the Change In Value provider creates a new folder every time an extractor returns a result that is different from the value returned previously. For a Batch of invoices, a Data Type could be created to find the invoice number on a document and use that as the separation point. Every time the Data Type finds a different invoice number on subsequent pages, a new Batch Folder will be created and Batch Pages will be placed into it until a new invoice number is found.

Document Classification

For Lexical classification, text features are used to assign Document Types. In order to locate these features, Data Types are used to return them. These features are trained across sample documents for a particular Document Type and given a weighting using a TF-IDF algorithm. For unclassified documents, features returned by the Data Type are compared to the trained documents according to these weighting values. If the document's features match heavily weighted features of a particular Document Type, it is classified as that Document Type. For example, a Data Type can be configured to locate single words (also called unigrams) on a document.

Using a Rules-Based classification method, you can create a "Positive Rule" on the Document Type to classify Batch Folders. Data Types can be referenced as this Positive Rule (using the Document Type's Positive Extractor property). If the Data Type returns a result, the Batch Folder is assigned the Document Type. A simple example would be a Data Type returning a header title for a particular type of document.

Populating a Data Model

Possibly the most obvious reason you need a data extractor is to locate and extract data from your documents! Data Types can do that too. Once a Data Model is created and Data Fields, Data Sections, and Data Tables are established in the model, Data Types return values from the document for each data element in the model.

How To

Create a New Data Type

Before you begin

There are no "hard" or "absolute" prerequisites to creating a Data Type. You could technically create one the very first time you open Grooper Design Studio after install.

However, you likely will want a Test Batch of documents, with text already obtained via the Recognize activity. This way, you'll be able to verify the Data Type you configure is extracting the right data from the right documents.

Where Will the Data Type Live?

Data Types can be created in one of three locations:

Create data type.png

1. In any "Local Resources" folder (or subfolder) in a Content Model.

Example Path: Root/Content Models/Content Model Name/(local resources)/


2. As a child object of a Data Field in a Data Model (or as a child of a Data Column)

Example Path: Root/Content Models/Content Model Name/(data model)/Data Field Name/


3. The "Data Types" folder (or subfolder) in the "Data Extraction" folder in the Node Tree.

Path: Root/Data Extraction/Data Types/

FYI Most often Data Types will be created and stored within folders in the Local Resources folder. For more information on extractor organization, visit the Asset Management article.


Add the Data Type

Locate where you want to create the Data Type. For this example, we will add the Data Type to the Local Resources folder of a Content Model (named "Demo Model).

Right click the Local Resources folder. However over "Add" and select "Data Type..."


Create data type 2.png


The following window will pop up. Give your Data Type a descriptive name. Press the "OK" button when finished.


Create data type 3.png


FYI A single Content Model may use hundreds of Data Types depending on the complexity of your project. Furthermore, Data Types are not used for just one thing. A standard asset naming and foldering convention can be very helpful to keep yourself organized and quickly identify your extractors. For more information on this topic, visit the Asset Management article.

Configure the Data Type

After you name the Data Type, it will be created as a child of whatever object you added it to. Below is the configuration screen for a blank Data Type.


Create data type 4.png

Create a Child Data Format or Data Type

Reference an Extractor on a Data Type

Anantomy of a Data Type - Navigating the Configuration Screen and Property Panel

There are four windows to the Data Type's configuration screen.

The Property Panel

The property panel contains all the editable properties for a Data Type object. This includes a "Pattern" property to set a regex pattern via the Pattern Editor, the "Collation" property to set the Data Type's collation method, and the "Referenced Extractors" property to reference the result of other Data Types discussed earlier. Tab through the remaining tabs for more information about each property.


Data type anat 1.png


The Batch Selector

Here, you can select documents from a Test Batch, using a drop down list. These will be any batches in the Root Node/Batch Processing/Batches/Test/ folder. Use the documents in a test batch to verify the Data Type returns accurate results.


Data type anat 2.png


The Document Viewer

Here, you view the currently selected document in the Batch Selector. The "Image View" tab shows the document's (or page's) image. There are magnification and selection tool icons at the top of the screen. The "Text View" tab shows the text flow of extracted text from the Recognize activity. Results the Data Type returns will be highlighted on the page in green.


Data type anat 3.png


The Results Screen

All results the Data Type returns show up in the "Results Screen"


Data type anat 4.png