Data Type (Object)

From Grooper Wiki

This article is about the current version of Grooper.

Note that some content may still need to be updated.

2025 2023.120232.80
A sample Data Type extractor in the Node Tree

pin Data Type objects hold a collection of child, referenced, and locally defined Data Extractors and settings that manage how multiple (even differing) matches from Data Extractors are consolidated (via Collation) into a result set.

The matching pattern (using the Data Type's Pattern property) or patterns (using child Value Readers or Data Types) will return as a list of values. The returned values can be further collated, isolated, and manipulated by configuring the properties of the Data Type. Data Types have a variety of uses in Grooper. Not only are they used to populate Data Fields in a Data Model with extracted text, but can be used to separate pages into document folders, classify documents, and more.

You may download the ZIP(s) below and upload it into your own Grooper environment (version 2024). The first contains one or more Batches of sample documents. The second contains one or more Projects with resources used in examples throughout this article.

About

Data Type extractors are the main way information is found and used on a document. Say, you want to use the form number information on the document below to separate loose pages into documents.

You need a Data Type! The Data Type will find the form number. The Separate activity will use that Data Type to separate this page into a new folder.


Say you want to classify this contract as a "Lease" document type, using the header title "Oil and Gas Lease".

You need a Data Type! The Data Type will find the document heading. You can then set up a Content Model and make a rule where if that Data Type finds that heading, the document gets classified as "Lease" during the Classify activity.


Say you want to grab all the highlighted information from this form.

You need a Data Type! You can create a Data Model with fields for the "Production Unit Number", the "Gross Volume", "Taxable Value", and all the other data elements on the page (Technically you need multiple Data Types. One for each data element). You then will point your Data Model to the Data Type extractors that find their corresponding values.

How Data Types Locate Data

Data Types return information on a page by using regular expression pattern matching. Regular expression (or regex) uses a standard syntax to match patterns of text in a block of text. For example, the regular expression "ball" would match every word "ball" in the string of text "ball, football, baseball, 8-ball ball in hand, balloon"

Regex Pattern Text Matches
ball ball, football, baseball, 8-ball ball in hand, balloon ball, football, baseball, 8-ball ball in hand, balloon

This is a very specific pattern literally matching only the string of characters "ball". Regex patterns also take advantage of a specific syntax to match more general patterns. For example, the "\d" character in regex will match any digit character 0 through 9.

For more information on regular expression pattern matching, visit the Regular Expression article.

Before regex can match text on a document, you have to extract machine readable text from the page! Data Type extractors will return no results without any raw text data to match. You must first obtain text from your documents via the Recognize activity. The Recognize activity will extract machine readable text from images through OCR as well as extracting native text from digital PDFs.

Once you have extracted text for a document via the Recognize activity (either through OCR for image based documents or native text extraction from digital PDFs), Data Type extractors can use regular expression to match text in whatever way you deem necessary. The simplest configuration of a Data Type extractor uses a regular expression pattern (written using the "Pattern" property and the Pattern Editor) to match text on a document and return the matches as individual results.


A simple Data Type extractor returning the match from a simple regex pattern on a simple document.

  1. The Data Type's Local Extractor property is set to Pattern Match and the Value Pattern property is given a regular expression.
  2. The returned result of the regular expression from the Value Pattern property is highlighted in the Document Viewer...
  3. ...and displayed in the Result List view.


Data Types are also much more robust than simple regex pattern matching. While regular expression is a huge part of how Data Types return data from a document, it is only the beginning. Two other concepts are critically important to understanding how Data Types work: Inheritance and Collation.

Inheritance

Data Types inherit the values returned by any child extractor created under it (as well as any extractor it references). This allows a single extractor to return multiple values using multiple patterns and extractor configurations.


Data Types can have both Value Reader and Data Type extractors as children.


For example, the extractor below has two Value Reader child extractors. One finds the word "HELLO". The other finds the word "WORLD!". Both results are returned by the parent Data Type.

Value Readers as Children of Data Types

Value Readers are very simple extractors. While they can act as stand alone extractor objects, it is very common to use Value Reader extractor objects as children to parent Data Types. Value Readers can use any extractor type on their Extractor property, but its very common for them to leverage text parsing extractor types that leverage regular expressions.

  • Value Readers leveraging text parsing extractor types on their Extractor property (such as Pattern Match) configure their regular expressions via the Value Pattern, Prefix Pattern, Suffix Pattern, and Output Format properties within the Expressions tab.


  • Further configurations of a text parsing extractor can be configured via the myriad properties found within the Properties tab.


Value Reader extractors are useful for patterning multiple varieties in which a data can be formatted. Think about the different ways in which a date can be formatted.

These are all different ways to express the same information.

  • 06/12/1985
  • June 12, 1985
  • 12 June 1985
  • 1985-06-12
  • 12th day of June 1985

It would be difficult to match each one of these five date formats using a single regular expression. However, it's relatively easy to match each format with five different regex patterns.

  1. In this example a Data Type has several child Value Reader extractor objects.
  2. Each is Value Reader has its Extractor property set to Pattern Match and is configured with a different regular expression to match the different date formats. The results are highlighted in the document viewer.
  3. And the results are also displayed in the Result List view.

Data Types as Children of Data Types

Data Type extractors can also be children of other Data Type extractors. Any result the child Data Type returns will be fed to the parent Data Type. This includes the results of child Data Types own children! This way, the child Data Type can take advantage of the properties available to Data Type objects not available to Value Readers, such as collation (more on collation below).

See below, the parent Data Type named "Sample Data Type" has three children. Two Value Readers and one Data Type. Every result each three child extractors returns are returned by the parent Data Type.

  1. In the below example a parent Data Type has several child extractor objects including Value Readers and a Data Type.
  2. Each extractor is using a text parsing extractor type, like Pattern Match, and their results are displayed in the Document Viewer.
  3. And the results are also displayed in the Result List view.

Referenced Extractors

  1. In the below example a parent Data Type has child Value Readers, but instead of nesting another Data Type as a child object...
  2. ...the Referenced Extractors property is used to point at a Data Type that is not a child object.
  3. The parent Data Type inherits from not only its child Value Reader objects, but also from the extractor (in this case another Data Type) leveraged in the Referenced Extractors property.
  4. Each extractor is using a text parsing extractor type, like Pattern Match, and their results are displayed in the Document Viewer.
  5. And the results are also displayed in the Result List view.


This can be very helpful from an asset management perspective. When a Data Type's results need to be used by multiple different parent Data Types, there's no need to create multiple separate child Data Types for each parent. Instead, a single Data Type can be created as its own object in the Node Tree and all parent Data Types can reference the same object as if it were a child, using the "Referenced Extractors" property.

Collation

One of the main benefits to Data Type extractors is their ability to manipulate data through various collation providers. As much as a Data Type is a "data finder", they are also a "data collator".


For example, our "Sample Data Type" extractor has two children, one to find the word "HELLO" on the page and one to find the word "WORLD!". The default collation type is "Individual", which returns the individual results of the parent Data Type and all its children.


But, what if we don't want the two separate words returned but the whole phrase "HELLO WORLD!"? That's where collation providers come into play. For instance, the "Combine" collation provider will combine all results into a single result.


Furthermore, each collation provider has its own set of configurable properties. You can see in the example above the words "HELLO" and "WORLD!" were added together as one string without spaces, yielding the result "HELLOWORLD!" This is because there is no space character at the end of the string "HELLO". We can however add a space, using the "Collation" sub-property "Result Separator". Entering a space character here will insert a space character between each combined result.


There are nine different Collation Providers available to Data Types:

Each one manipulates or organizes results returned by the Data Type and its children in different ways. For more information on each provider, visit the provider type's full article.

Furthermore, data context is critical to understanding your documents and building your Data Types. For more information on this topic, visit the Data Context article.

Use Cases

The total number of uses for Data Types are quite large. Essentially, any time you need to extract text data to store a value or use that value to do something, you need a Data Type. However, they fall into three main categories.

Document Separation

Data Types are used by various Separation Providers to determine at what point document folders are created in a Batch.

For example, the Change In Value provider creates a new folder every time an extractor returns a result that is different from the value returned previously. For a Batch of invoices, a Data Type could be created to find the invoice number on a document and use that as the separation point. Every time the Data Type finds a different invoice number on subsequent pages, a new Batch Folder will be created and Batch Pages will be placed into it until a new invoice number is found.

Document Classification

For Lexical classification, text features are used to assign Document Types. In order to locate these features, Data Types are used to return them. These features are trained across sample documents for a particular Document Type and given a weighting using a TF-IDF algorithm. For unclassified documents, features returned by the Data Type are compared to the trained documents according to these weighting values. If the document's features match heavily weighted features of a particular Document Type, it is classified as that Document Type. For example, a Data Type can be configured to locate single words (also called unigrams) on a document.

Using a Rules-Based classification method, you can create a "Positive Rule" on the Document Type to classify Batch Folders. Data Types can be referenced as this Positive Rule (using the Document Type's Positive Extractor property). If the Data Type returns a result, the Batch Folder is assigned the Document Type. A simple example would be a Data Type returning a header title for a particular type of document.

Populating a Data Model

Possibly the most obvious reason you need a data extractor is to locate and extract data from your documents! Data Types can do that too. Once a Data Model is created and Data Fields, Data Sections, and Data Tables are established in the model, Data Types return values from the document for each data element in the model.

How To

Create a New Data Type

Before you begin

There are no "hard" or "absolute" prerequisites to creating a Data Type. You could technically create one the very first time you open the Grooper Design page after install.

However, you likely will want a Test Batch of documents, with text already obtained via the Recognize activity. This way, you'll be able to verify the Data Type you configure is extracting the right data from the right documents.

Where Will the Data Type Live?

Data Types can be created in one of three different locations within the Node Tree of the Design page.

  1. As direct children of a Project.
  2. Inside any Local Resources folder, or sub-folders of the Local Resources folder, within a Content Model.
  3. Within any folder of the Essentials Project provided with a default Grooper installation.

FYI

Most often Data Types will be created and stored within folders in the Local Resources folder. For more information on extractor organization, visit the Asset Management article.

Add the Data Type

  1. Righ-click on an appropriate location and choose "Add > Data Type".


  1. In the "Add" window give an appropriate name to the Data Type in the Name property, then click the "Execute" button.


FYI

A single Content Model may use hundreds of Data Types depending on the complexity of your project. Furthermore, Data Types are not used for just one thing. A standard asset naming and foldering convention can be very helpful to keep yourself organized and quickly identify your extractors. For more information on this topic, visit the Asset Management article.

Configure the Data Type

After creation of the Data Type, its configuration depends entirely on what type of data you are attempting to capture with it. However, the considerations that are the most important and will be the most commonly configured are:

  • Local Extractor property: This is the primary extractor type set on the Data Type. The results returned from the configuration of this property are returned before child extractors or Referenced Extractors. Please see the Extractor Type article for more information on extractor types.
  • Child extractors: You may wish to add extractor objects, such as other Data Types or Value readers, as child objects to a Data Type. To do so, right click the Data Type and choose "Add" and choose an appropriate extractor object to add. The configuration of child extractor objects will depend on the type chosen and that data they seek to capture. The results returned from child extractors of a Data Type occur after extraction results from the Data Type's Local Extractor property, but before Referenced Extractors.
  • Referenced Extractors property: In the situation that you want the results of other extractor objects to be returned by a Data Type, but you do not want them to be immediate children of the Data Type, you can use the Referenced Extractors property to point a Data Type at one or more extractor objects. Clicking the ellipsis button for this property will open a collection editor allowing you to select multiple extractor objects. The results returned from Referenced Extractors of a Data Type occur after extraction results from the Data Type's Local Extractor property, as well as after extraction results of child extractor objects of the Data TYpe.
  • Collation property: One of the more powerful aspects of using a Data Type is how it can collation returned results. This property can be set to several different styles of Collation Providers, and those different providers will have their own configuration as well. Please see the Collation Provider article for more information.