2023:Data Type (Node Type)

From Grooper Wiki
Revision as of 11:02, 26 August 2024 by Randallkinard (talk | contribs)

This article is about an older version of Grooper.

Information may be out of date and UI elements may have changed.

2025202420232.80
A sample Data Type extractor in the Node Tree

pin Data Types are nodes used to extract text data from a document. Data Types have more capabilities than quick_reference_all Value Readers. Data Types can collect results from multiple extractor sources, including a locally defined extractor, child extractor nodes, and referenced extractor nodes. Data Types can also collate results using Collation Providers to combine, sift and manipulate results further.

The matching pattern (using the Data Type's Pattern property) or patterns (using child Data Formats or Data Types) will return as a list of values. The returned values can be further collated, isolated, and manipulated by configuring the properties of the Data Type. Data Types have a variety of uses in Grooper. Not only are they used to populate Data Fields in a Data Model with extracted text, but can be used to separate pages into document folders, classify documents, and more.

You may download the ZIP(s) below and upload it into your own Grooper environment (version 2023.1). The first contains one or more Batches of sample documents. The second contains one or more Projects with resources used in examples throughout this article.

About

Data Type extractors are the main way information is found and used on a document. Say, you want to use the form number information on the document below to separate loose pages into documents.

You need a Data Type! The Data Type will find the form number. The Separate activity will use that Data Type to separate this page into a new folder.


Say you want to classify this contract as a "Lease" document type, using the header title "Oil and Gas Lease".

You need a Data Type! The Data Type will find the document heading. You can then set up a Content Model and make a rule where if that Data Type finds that heading, the document gets classified as "Lease" during the Classify activity.


Say you want to grab all the highlighted information from this form.

You need a Data Type! You can create a Data Model with fields for the "Production Unit Number", the "Gross Volume", "Taxable Value", and all the other data elements on the page (Technically you need multiple Data Types. One for each data element). You then will point your Data Model to the Data Type extractors that find their corresponding values.

How Data Types Locate Data

Data Types return information on a page by using regular expression pattern matching. Regular expression (or regex) uses a standard syntax to match patterns of text in a block of text. For example, the regular expression "ball" would match every word "ball" in the string of text "ball, football, baseball, 8-ball ball in hand, balloon"

Regex Pattern Text Matches
ball ball, football, baseball, 8-ball ball in hand, balloon ball, football, baseball, 8-ball ball in hand, balloon

This is a very specific pattern literally matching only the string of characters "ball". Regex patterns also take advantage of a specific syntax to match more general patterns. For example, the "\d" character in regex will match any digit character 0 through 9.

For more information on regular expression pattern matching, visit the Regular Expression article.

Before regex can match text on a document, you have to extract machine readable text from the page! Data Type extractors will return no results without any raw text data to match. You must first obtain text from your documents via the Recognize activity. The Recognize activity will extract machine readable text from images through OCR as well as extracting native text from digital PDFs.

Once you have extracted text for a document via the Recognize activity (either through OCR for image based documents or native text extraction from digital PDFs), Data Type extractors can use regular expression to match text in whatever way you deem necessary. The simplest configuration of a Data Type extractor uses a regular expression pattern (written using the "Pattern" property and the Pattern Editor) to match text on a document and return the matches as individual results.


A simple Data Type extractor returning the match from a simple regex pattern on a simple document.


Data Types are also much more robust than simple regex pattern matching. While regular expression is a huge part of how Data Types return data from a document, it is only the beginning. Two other concepts are critically important to understanding how Data Types work: Inheritance and Collation.

Inheritance

Data Types inherit the values returned by any child extractor created under it (as well as any extractor it references). This allows a single extractor to return multiple values using multiple patterns and extractor configurations.

Data Types can have both Data Format and Data Type extractors as children.



For example, the extractor below has two "Data Format" children. One finds the word "HELLO". The other finds the word "WORLD!". Both results are returned by the parent Data Type.

Data Format Extractors

Data Formats are very simple extractors. They are only created as children of Data Type extractors. They cannot be created as a free-standing object. They are bitty baby objects that need to hold mommy's hand.

They too use regular expression to return matches against the raw text data. They are configured using only the Pattern Editor and the properties available to the Pattern Editor.





Data Format extractors are useful for patterning multiple varieties in which a data can be formatted. Think about the different ways in which a date can be formatted. These are all different ways to express the same information.
  • 06/12/1985
  • June 12, 1985
  • 12 June 1985
  • 1985-06-12
  • 12th day of June 1985
It would be difficult to match each one of these five date formats using a single regular expression. However, it's relatively easy to match each format with five different regex patterns.

Data Types as Children of Data Types

Data Type extractors can also be children of other Data Type extractors. Any result the child Data Type returns will be fed to the parent Data Type. This includes the results of child Data Types own children! This way, the child Data Type can take advantage of the properties available to Data Type objects not available to Data Formats, such as collation (more on collation below).

See below, the parent Data Type named "Sample Data Type" has three children. Two Data Formats and one Data Type. Every result each three child extractors returns are returned by the parent Data Type.

Referenced Extractors



This can be very helpful from an asset management perspective. When a Data Type's results need to be used by multiple different parent Data Types, there's no need to create multiple separate child Data Types for each parent. Instead, a single Data Type can be created as its own object in the Node Tree and all parent Data Types can reference the same object as if it were a child, using the "Referenced Extractors" property.

Collation

One of the main benefits to Data Type extractors is their ability to manipulate data through various collation providers. As much as a Data Type is a "data finder", they are also a "data collator".

For example, our "Sample Data Type" extractor has two children, one to find the word "HELLO" on the page and one to find the word "WORLD!". The default collation type is "Individual", which returns the individual results of the parent Data Type and all its children.

But, what if we don't want the two separate words returned but the whole phrase "HELLO WORLD!"? That's where collation providers come into play. For instance, the "Combine" collation provider will combine all results into a single result.

Furthermore, each collation provider has its own set of configurable properties. You can see in the example above the words "HELLO" and "WORLD!" were added together as one string without spaces, yielding the result "HELLOWORLD!" This is because there is no space character at the end of the string "HELLO". We can however add a space, using the "Collation" sub-property "Result Separator". Entering a space character here will insert a space character between each combined result.

There are nine different Collation Providers available to Data Types: Each one manipulates or organizes results returned by the Data Type and its children in different ways. For more information on each provider, visit the provider type's full article. Furthermore, data context is critical to understanding your documents and building your Data Types. For more information on this topic, visit the Data Context article.

Use Cases

The total number of uses for Data Types are quite large. Essentially, any time you need to extract text data to store a value or use that value to do something, you need a Data Type. However, they fall into three main categories.

Document Separation

Data Types are used by various Separation Providers to determine at what point document folders are created in a Batch.

For example, the Change In Value provider creates a new folder every time an extractor returns a result that is different from the value returned previously. For a Batch of invoices, a Data Type could be created to find the invoice number on a document and use that as the separation point. Every time the Data Type finds a different invoice number on subsequent pages, a new Batch Folder will be created and Batch Pages will be placed into it until a new invoice number is found.

Document Classification

For Lexical classification, text features are used to assign Document Types. In order to locate these features, Data Types are used to return them. These features are trained across sample documents for a particular Document Type and given a weighting using a TF-IDF algorithm. For unclassified documents, features returned by the Data Type are compared to the trained documents according to these weighting values. If the document's features match heavily weighted features of a particular Document Type, it is classified as that Document Type. For example, a Data Type can be configured to locate single words (also called unigrams) on a document.

Using a Rules-Based classification method, you can create a "Positive Rule" on the Document Type to classify Batch Folders. Data Types can be referenced as this Positive Rule (using the Document Type's Positive Extractor property). If the Data Type returns a result, the Batch Folder is assigned the Document Type. A simple example would be a Data Type returning a header title for a particular type of document.

Populating a Data Model

Possibly the most obvious reason you need a data extractor is to locate and extract data from your documents! Data Types can do that too. Once a Data Model is created and Data Fields, Data Sections, and Data Tables are established in the model, Data Types return values from the document for each data element in the model.

How To

Create a New Data Type

Before you begin

There are no "hard" or "absolute" prerequisites to creating a Data Type. You could technically create one the very first time you open Grooper Design Studio after install.

However, you likely will want a Test Batch of documents, with text already obtained via the Recognize activity. This way, you'll be able to verify the Data Type you configure is extracting the right data from the right documents.

Where Will the Data Type Live?



FYI

Most often Data Types will be created and stored within folders in the Local Resources folder. For more information on extractor organization, visit the Asset Management article.


Add the Data Type



FYI

A single Content Model may use hundreds of Data Types depending on the complexity of your project. Furthermore, Data Types are not used for just one thing. A standard asset naming and foldering convention can be very helpful to keep yourself organized and quickly identify your extractors. For more information on this topic, visit the Asset Management article.

Configure the Data Type

Create a Child Data Format or Data Type

Reference an Extractor on a Data Type

Anantomy of a Data Type - Navigating the Configuration Screen and Property Panel

There are four windows to the Data Type's configuration screen.

The Property Panel

The Batch Selector

The Document Viewer

The Results Screen

Glossary

AND: AND is a Collation Provider option for pin Data Type extractors. AND returns results only when each of its referenced or child extractors gets at least one hit, thus acting as a logical “AND” operator across multiple extractors.

Array: Array is a Collation Provider option for pin Data Type extractors. Array matches a list of values arranged in horizontal, vertical, or text-flow order, combining instances that qualify into a single result.

Batch Folder: The folder Batch Folder is an organizational unit within a inventory_2 Batch, allowing for a structured approach to managing and processing a collection of documents. Batch Folder nodes serve two purposes in a Batch. (1) Primarily, they represent "documents" in Grooper. (2) They can also serve more generally as folders, holding other Batch Folders and/or contract Batch Page nodes as children.

  • Batch Folders are frequently referred to simply as "documents" or "folders" depending on how they are used in the Batch.

Batch Page: contract Batch Page nodes represent individual pages within a inventory_2 Batch. Batch Pages are created in one of two ways: (1) When images are scanned into a Batch using the Scan Viewer. (2) Or, when split from a PDF or TIFF file using the Split Pages activity.

  • Batch Pages are frequently referred to simply as "pages".

Batch: inventory_2 Batch nodes are fundamental in Grooper's architecture. They are containers of documents that are moved through workflow mechanisms called settings Batch Processes. Documents and their pages are represented in Batches by a hierarchy of folder Batch Folders and contract Batch Pages.

Classification: Classification is the process of identifying and organizing documents into categorical types based on their content or layout. Classification is key for efficient document management and data extraction workflows. Grooper has different methods for classifying documents. These include methods that use machine learning and text pattern recognition. In a Grooper Batch Process, the Classify Activity will assign a Content Type to a folder Batch Folder.

Classify: unknown_document Classify is an Activity that "classifies" folder Batch Folders in a inventory_2 Batch by assigning them a description Document Type.

  • Classification is key to Grooper's document processing. It affects how data is extracted from a document (during the Extract activity) and how Behaviors are applied.
  • Classification logic is controlled by a Content Model's "Classify Method". These methods include using text patterns, previously trained document examples, and Label Sets to identify documents.

Collation Provider: The Collation property of a pin Data Type defines the method for converting its raw results into a final result set. It is configured by selecting a Collation Provider. The Collation Provider governs how initial matches from the Data Type's extractor(s) are combined and interpreted to produce the Data Type's final output.

Combine: Combine is a Collation Provider option for pin Data Type extractors. Combine combines instances from returned results based on a specified grouping, controlling how extractor results are assembled together for output.

Content Model: stacks Content Model nodes define a classification taxonomy for document sets in Grooper. This taxonomy is defined by the collections_bookmark Content Categories and description Document Types they contain. Content Models serve as the root of a Content Type hierarchy, which defines Data Element inheritance and Behavior inheritance. Content Models are crucial for organizing documents for data extraction and more.

Data Context: Data Context refers to contextual information used to extract data, such as a label that identifies the value you want to collect.

Data Extractor: Data Extractor (or just "extractor") refers to all Value Extractors and Extractor Nodes. Extractors define the logic used to return data from a document's text content, including general data (such as a date) and specific data (such as an agreement date on a contract).

Data Field: variables Data Fields represent a single value targeted for data extraction on a document. Data Fields are created as child nodes of a data_table Data Model and/or insert_page_break Data Sections.

  • Data Fields are frequently referred to simply as "fields".

Data Model: data_table Data Models are leveraged during the Extract activity to collect data from documents (folder Batch Folders). Data Models are the root of a Data Element hierarchy. The Data Model and its child Data Elements define a schema for data present on a document. The Data Model's configuration (and its child Data Elements' configuration) define data extraction logic and settings for how data is reviewed in a Data Viewer.

Data Section: A insert_page_break Data Section is a container for Data Elements in a data_table Data Model. variables They can contain Data Fields, table Data Tables, and even Data Sections as child nodes and add hierarchy to a Data Model. They serve two main purposes:

  1. They can simply act as organizational buckets for Data Elements in larger Data Models.
  2. By configuring its "Extract Method", a Data Section can subdivide larger and more complex documents into smaller parts to assist in extraction.
    • "Single Instance" sections define a division (or "record") that appears only once on a document.
    • "Multi-Instance" sections define collection of repeating divisions (or "records").

Data Table: A table Data Table is a Data Element specialized in extracting tabular data from documents (i.e. data formatted in rows and columns).

  • The Data Table itself defines the "Table Extract Method". This is configured to determine the logic used to locate and return the table's rows.
  • The table's columns are defined by adding view_column Data Column nodes to the Data Table (as its children).

Data Type: pin Data Types are nodes used to extract text data from a document. Data Types have more capabilities than quick_reference_all Value Readers. Data Types can collect results from multiple extractor sources, including a locally defined extractor, child extractor nodes, and referenced extractor nodes. Data Types can also collate results using Collation Providers to combine, sift and manipulate results further.

Document Type: description Document Type nodes represent a distinct type of document, such as an invoice or a contract. Document Types are created as child nodes of a stacks Content Model or a collections_bookmark Content Category. They serve three primary purposes:

  1. They are used to classify documents. Documents are considered "classified" when the folder Batch Folder is assigned a Content Type (most typically, a Document Type).
  2. The Document Type's data_table Data Model defines the Data Elements extracted by the Extract activity (including any Data Elements inherited from parent Content Types).
  3. The Document Type defines all "Behaviors" that apply (whether from the Document Type's Behavior settings or those inherited from a parent Content Type).

Document Viewer: The Grooper Document Viewer is the portal to your documents. It is the UI that allows you to see a folder Batch Folder's (or a contract Batch Page's) image, text content, and more.

Extract: export_notes Extract is an Activity that retrieves information from folder Batch Folder documents, as defined by Data Elements in a data_table Data Model. This is how Grooper locates unstructured data on your documents and collects it in a structured, usable format.

Key-Value List: Key-Value List is a Collation Provider option for pin Data Type extractors. Key-Value List matches instances where a key and a list of one or more values appear together on the document, adhering to a specific layout pattern.

Key-Value Pair: Key-Value Pair is a Collation Provider option for pin Data Type extractors. Key-Value Pair matches instances where a key is paired with a value on the document in a specific layout. Note: Key-Value Pair is an older technique in Grooper. In most cases, the Labeled Value extractor is preferable to Key-Value Pair collation.

Lexical: "Lexical" is a Classify Method that classifies folder Batch Folders based on the text content of trained document examples. This is achieved through the statistical analysis of word frequencies that identify description Document Types.

Multi-Column: Multi-Column is a Collation Provider option for pin Data Type extractors. Multi-Column combines multiple columns on a page into a single column for extraction.

Node Tree: The Node Tree is the hierarchical list of Grooper node objects found in the left panel in the Design Page. It is the basis for navigation and creation in the Design Page.

OCR: OCR is stands for Optical Character Recognition. It allows text on paper documents to be digitized, in order to be searched or edited by other software applications. OCR converts typed or printed text from digital images of physical documents into machine readable, encoded text.

Ordered Array: Ordered Array is a Collation Provider option for pin Data Type extractors. Ordered Array finds sequences of values where one result is present for each extractor, in the order they appear, according to a specified horizontal, vertical or text-flow layout.

Pattern-Based: Pattern-Based is a Collation Provider option for pin Data Type extractors. Pattern-Based uses regular expressions to sequence returned results into a final result set.

Project: package_2 Projects are the primary containers for configuration nodes within Grooper. The Project is where various processing objects such as stacks Content Models, settings Batch Processes, profile objects are stored. This makes resources easier to manage, easier to save, and simplifies how node references are made in a Grooper Repository.

Recognize: format_letter_spacing_wide Recognize is an Activity that obtains machine-readable text from contract Batch Pages and folder Batch Folders. When properly configured with an library_booksOCR Profile, Recognize will selectively perform OCR for images and native-text extraction for digital text in PDFs. Recognize can also reference an perm_mediaIP Profile to collect "layout data" like lines, checkboxes, and barcodes. Other Activities then use this machine-readable text and layout data for document analysis and data extraction.

Reference: Reference is a Value Extractor used to reference an Extractor Node. This allows users to create re-usable extractors and use the more complex pin Data Type and input Field Class extractors throughout Grooper.

Regular Expression: Regular Expression (or regex) is a standard syntax designed to parse text strings. This is a way of finding information in text. It is the primary method by which Grooper extracts and returns data from documents.

Rules-Based: "Rules-Based" is a Classify Method that employs "rules" defined on each description Document Type to classify folder Batch Folders. Positive Extractor and Negative Extractor properties are configured for each Document Type to positively or negatively associate a Batch Folder based on predefined criteria.

  • Where the Positive and Negative Extractors will impact all Classify Method results, the Rules-Based method classifies using only these properties and nothing else.

Separate: insert_page_break Separate is an Activity that sorts contract Batch Pages into individual folder Batch Folders. This distinguishes "loose pages" from the documents formed by those pages. Once loose pages are separated into Batch Folder documents, they can be further processed by unknown_document Classify, export_notes Extract, output Export and other Activities that need to run on the folder (i.e. document) level.

Separation Provider: The Provider property of the Separate Activity defines the type of separation to be performed at the designated Scope.

Separation: Separation is the process of taking an unorganized inventory_2 Batch of loose contract Batch Pages and organizing them into documents represented by folder Batch Folders in Grooper. This is done so Grooper can later assign a description Document Type to each document folder in a process known as "classification".

Split: Split is a Collation Provider option for pin Data Type extractors. Split separates a data instance at each match returned by the Data Type. The results are used as anchor points to "split" text into one or more smaller parts.

Test Batch: "Test Batch" is a specialized Import Provider designed to facilitate the import of content from an existing inventory_2 Batch in the test environment. This provider is most commonly used for testing, development, and validation scenarios, and is not intended for production use.

  • Looking for information on "production" vs "test" Batches in Grooper? See here.