2023:Collation Provider (Property)

From Grooper Wiki

This article is about an older version of Grooper.

Information may be out of date and UI elements may have changed.

20252023.120232.90

STUB

This article is a stub. It contains minimal information on the topic and should be expanded.

The Collation property of a pin Data Type defines the method for converting its raw results into a final result set. It is configured by selecting a Collation Provider. The Collation Provider governs how initial matches from the Data Type's extractor(s) are combined and interpreted to produce the Data Type's final output.

Results can be combined, organized into arrays, returned as a key-value pair's value, and more.

The following Collation Providers are available in Grooper:

  • Individual
  • Combine
  • AND
  • Key-Value Pair
  • Key-Value List
  • Array
  • Ordered Array
  • Split
  • Pattern-Based
  • Multi-Column

You may download the ZIP(s) below and upload it into your own Grooper environment (version 2023.1). The first contains one or more Batches of sample documents. The second contains one or more Projects with resources used in examples throughout this article.

Glossary

Activity: Grooper Activities define specific document processing operations done to a inventory_2 Batch, folder Batch Folder, or contract Batch Page. In a settings Batch Process, each edit_document Batch Process Step executes a single Activity (determined by the step's "Activity" property).

  • Batch Process Steps are frequently referred by the name of their configured Activity followed by the word "step". For example: "Classify step".

AND: AND is a Collation Provider option for pin Data Type extractors. AND returns results only when each of its referenced or child extractors gets at least one hit, thus acting as a logical “AND” operator across multiple extractors.

Array: Array is a Collation Provider option for pin Data Type extractors. Array matches a list of values arranged in horizontal, vertical, or text-flow order, combining instances that qualify into a single result.

Batch Folder: The folder Batch Folder is an organizational unit within a inventory_2 Batch, allowing for a structured approach to managing and processing a collection of documents. Batch Folder nodes serve two purposes in a Batch. (1) Primarily, they represent "documents" in Grooper. (2) They can also serve more generally as folders, holding other Batch Folders and/or contract Batch Page nodes as children.

  • Batch Folders are frequently referred to simply as "documents" or "folders" depending on how they are used in the Batch.

Batch Page: contract Batch Page nodes represent individual pages within a inventory_2 Batch. Batch Pages are created in one of two ways: (1) When images are scanned into a Batch using the Scan Viewer. (2) Or, when split from a PDF or TIFF file using the Split Pages activity.

  • Batch Pages are frequently referred to simply as "pages".

Batch Process: settings Batch Process nodes are crucial components in Grooper's architecture. A Batch Process is the step-by-step processing instructions given to a inventory_2 Batch. Each step is comprised of a "Code Activity" or a Review activity. Code Activities are automated by Activity Processing services. Review activities are executed by human operators in the Grooper user interface.

  • Batch Processes by themselves do nothing. Instead, they execute edit_document Batch Process Steps which are added as children nodes.
  • A Batch Process is often referred to as simply a "process".

Batch: inventory_2 Batch nodes are fundamental in Grooper's architecture. They are containers of documents that are moved through workflow mechanisms called settings Batch Processes. Documents and their pages are represented in Batches by a hierarchy of folder Batch Folders and contract Batch Pages.

Classification: Classification is the process of identifying and organizing documents into categorical types based on their content or layout. Classification is key for efficient document management and data extraction workflows. Grooper has different methods for classifying documents. These include methods that use machine learning and text pattern recognition. In a Grooper Batch Process, the Classify Activity will assign a Content Type to a folder Batch Folder.

Classify: unknown_document Classify is an Activity that "classifies" folder Batch Folders in a inventory_2 Batch by assigning them a description Document Type.

  • Classification is key to Grooper's document processing. It affects how data is extracted from a document (during the Extract activity) and how Behaviors are applied.
  • Classification logic is controlled by a Content Model's "Classify Method". These methods include using text patterns, previously trained document examples, and Label Sets to identify documents.

Collation Provider: The Collation property of a pin Data Type defines the method for converting its raw results into a final result set. It is configured by selecting a Collation Provider. The Collation Provider governs how initial matches from the Data Type's extractor(s) are combined and interpreted to produce the Data Type's final output.

Combine: Combine is a Collation Provider option for pin Data Type extractors. Combine combines instances from returned results based on a specified grouping, controlling how extractor results are assembled together for output.

Content Model: stacks Content Model nodes define a classification taxonomy for document sets in Grooper. This taxonomy is defined by the collections_bookmark Content Categories and description Document Types they contain. Content Models serve as the root of a Content Type hierarchy, which defines Data Element inheritance and Behavior inheritance. Content Models are crucial for organizing documents for data extraction and more.

Content Type: Content Types are a class of node types used used to classify folder Batch Folders. They represent categories of documents (stacks Content Models and collections_bookmark Content Categories) or distinct types of documents (description Document Types). Content Types serve an important role in defining Data Elements and Behaviors that apply to a document.

Data Column: view_column Data Columns represent columns in a table extracted from a document. They are added as child nodes of a table Data Table. They define the type of data each column holds along with its data extraction properties.

  • Data Columns are frequently referred to simply as "columns".
  • In the context of reviewing data in a Data Viewer, a single Data Column instance in a single Data Table row, is most frequently called a "cell".

Data Element: Data Elements are a class of node types used to collect data from a document. These include: data_table Data Models, insert_page_break Data Sections, variables Data Fields, table Data Tables, and view_column Data Columns.

Data Extractor: Data Extractor (or just "extractor") refers to all Value Extractors and Extractor Nodes. Extractors define the logic used to return data from a document's text content, including general data (such as a date) and specific data (such as an agreement date on a contract).

Data Model: data_table Data Models are leveraged during the Extract activity to collect data from documents (folder Batch Folders). Data Models are the root of a Data Element hierarchy. The Data Model and its child Data Elements define a schema for data present on a document. The Data Model's configuration (and its child Data Elements' configuration) define data extraction logic and settings for how data is reviewed in a Data Viewer.

Data Table: A table Data Table is a Data Element specialized in extracting tabular data from documents (i.e. data formatted in rows and columns).

  • The Data Table itself defines the "Table Extract Method". This is configured to determine the logic used to locate and return the table's rows.
  • The table's columns are defined by adding view_column Data Column nodes to the Data Table (as its children).

Data Type: pin Data Types are nodes used to extract text data from a document. Data Types have more capabilities than quick_reference_all Value Readers. Data Types can collect results from multiple extractor sources, including a locally defined extractor, child extractor nodes, and referenced extractor nodes. Data Types can also collate results using Collation Providers to combine, sift and manipulate results further.

Extract: export_notes Extract is an Activity that retrieves information from folder Batch Folder documents, as defined by Data Elements in a data_table Data Model. This is how Grooper locates unstructured data on your documents and collects it in a structured, usable format.

Extractor Type:

Key-Value List: Key-Value List is a Collation Provider option for pin Data Type extractors. Key-Value List matches instances where a key and a list of one or more values appear together on the document, adhering to a specific layout pattern.

Key-Value Pair: Key-Value Pair is a Collation Provider option for pin Data Type extractors. Key-Value Pair matches instances where a key is paired with a value on the document in a specific layout. Note: Key-Value Pair is an older technique in Grooper. In most cases, the Labeled Value extractor is preferable to Key-Value Pair collation.

Labeled Value: Labeled Value is a Value Extractor that identifies and extracts a value next to a label. This is one of the most commonly used extractors to extract data from structured documents (such as a standardized form) and static values on semi-structured documents (such as the header details on an invoice).

Machine: computer Machine nodes represent servers that have connected to the Grooper Repository. They are essential for distributing task processing loads across multiple servers. Grooper creates Machine nodes automatically whenever a server makes a new connection to a Grooper Repository's database. Once added, Machine nodes can be used to view server information and to manage Grooper Service instances.

Multi-Column: Multi-Column is a Collation Provider option for pin Data Type extractors. Multi-Column combines multiple columns on a page into a single column for extraction.

Ordered Array: Ordered Array is a Collation Provider option for pin Data Type extractors. Ordered Array finds sequences of values where one result is present for each extractor, in the order they appear, according to a specified horizontal, vertical or text-flow layout.

Pattern-Based: Pattern-Based is a Collation Provider option for pin Data Type extractors. Pattern-Based uses regular expressions to sequence returned results into a final result set.

Project: package_2 Projects are the primary containers for configuration nodes within Grooper. The Project is where various processing objects such as stacks Content Models, settings Batch Processes, profile objects are stored. This makes resources easier to manage, easier to save, and simplifies how node references are made in a Grooper Repository.

Row Match: The Row Match Table Extract Method uses regular expression pattern matching to determine a tables structure based on the pattern of each row and extract cell data from each column.

Separate: insert_page_break Separate is an Activity that sorts contract Batch Pages into individual folder Batch Folders. This distinguishes "loose pages" from the documents formed by those pages. Once loose pages are separated into Batch Folder documents, they can be further processed by unknown_document Classify, export_notes Extract, output Export and other Activities that need to run on the folder (i.e. document) level.

Split: Split is a Collation Provider option for pin Data Type extractors. Split separates a data instance at each match returned by the Data Type. The results are used as anchor points to "split" text into one or more smaller parts.

Table Extract Method: A Table Extract Method defines the settings and logic for a table Data Table to perform extraction. It is set by configuring the Extract Method property of the Data Table.

Table Extraction: "Table Extraction" refers to Grooper's ability to extract data from cells in tables on documents. This is accomplished by configuring the table Data Table and its child view_column Data Column elements in a data_table Data Model.

About

Data Type extractors in Grooper use regular expression to match a document's text data in order to return a particular piece of information. Extractors serve a variety of purposes. They can be used to populate fields in a Data Model, to separate and classify documents, to break up a document into sections, and more. For the most part, any time part of document's text data is needed or useful to do something, you need an extractor to find and return it.

Often, this requires something more complex than returning a single result. The relationships between multiple extraction results are often important. The fact results are physically related to each other on the page, or text exists between one or more results, or results are in one order versus another can be used accomplish various goals in Grooper.

For example, the Individual, Array, and Ordered Array Collation Providers all collate results differently.

Individual

The Individual Collation Provider returns all extraction results individually. This is the default Collation Provider for Data Type extractors.

  1. This Data Type extractor, whose results are seen here, has five child extractors, all passing their own results up to the parent extractor. The child extractors are as follows:
    • Name - This Data Type extractor returns names on the document. Here, US presidents.
    • Date - This Data Type extractor returns dates. Here, president's birthdays and inauguration days, depending on the table.
    • City/State - This Data Type extractor returns the city and state values listed for a president's birthday.
    • Number of Days - This Data Format extractor returns numbers. Here, the number of days in office.
    • Party - This Data Format extractor returns the results of a list of political party names.
  2. The parent Data Type's Collation property is set to Individual.
    • The Collation property determines the Collation Provider used.
  3. You can see in the "Results" panel, everything each child extractor returns to the parent Data Type is listed as a distinct, individual result. A total of 56 results with the first item physically on the page listed first.

Array

The Array provider organizes and returns results much differently.

  • First, it will only return results if multiple extraction results are lined up in a particular order on the page, according to the "layout" set for this provider. For example, an Array collated extractor using a Horizontal Layout will only return results if they are aligned horizontally, one result after another from left to right.
  • Second, instead of each result being returned individually, all results meeting the layout requirements are returned as a single value.

Essentially, an Array collated result is a collection of results who share a layout relationship, that are all lined up together (either horizontally, vertically, or in the left/right and top/bottom text flow of the document).

  1. This Data Type extractor has the exact same child extractors, but uses the Array provider instead of Individual.
  2. The parent Data Type's Collation property is set to Array.
  3. The Minimum Elements property defaults to 2.
    • This means the array must contain at least two extraction results. For this extractor, it could be a name and a date. It could be two dates. It could be forty dates. It could be a name, a date, a city/state location, a number, and a political party. It doesn't matter, as long as there are two results.
  4. The Horizontal Layout is set to Enabled.
    • At least one of the three Layout properties must be enabled. Using the Horizontal Layout, only results aligned horizontally with each other will count as an array. Here, this effectively returns full rows of each table, since one extraction result follows the other from left to right in a horizontal line.
  5. Notice we now only return 12 results. Rather than each individual result from each child extractor, results are ordered and returned according to the Array Collation Provider and its configuration.
    • Results are combined into a single result, as long as they are aligned horizontally with each other.

Ordered Array

The 'Ordered Array provider is similar to the Array provider, but it is much more restrictive about how allowable results can be organized. Only arrays whose extracted elements are in the listed order of the children extractors are returned.

  1. This Data Type extractor has the exact same child extractors, but uses the Ordered Array provider.
    • Notice the order of each child extractor. First "Name" then "Date" then "City/State" then "Number of Days" and last, "Party"
  2. The parent Data Type's Collation property is set to Ordered Array.
  3. The Horizontal Layout is set to Enabled, just like our example of the Array provider.
  4. Notice several arrays were tossed out of our "Results" list.
    • The second table has the "Days in Office" column before the "Birthplace". When it came time to finding that array, all the elements are there, but they are not in the order of the child elements in the Node Tree. The "Number of Days" extractor comes after the "City/State" extractor.
    • The third table has four out of the five elements present, and in the right order, but is missing the "Political Party" column (picked up by the "Party" extractor). Not only must the array's extracted elements match the order of the child extractors locating them, but all elements must be present.

Key-Value Pair and Key-Value List

Combine (and Combine Methods)

Split

Pattern-Based

AND

Multi-Column

Confidence Mode

The Confidence Mode property is new to the Combine, AND, Array, Ordered Array, Key-Value Pair and Key-Value List Collation Providers in version 2.90. Each of these providers orders and/or combines multiple extraction results in various ways. If any of results are matched using FuzzyRegEx, the overall confidence score of the collated result must be determined, in some way, based on the confidence of each individual result.

For example, an Array may have three results as its elements. Result 1 may have a confidence score of 100%. Result 2 may have a confidence score of 90%. Result 3 may have a confidence score of 80%. So, what is the confidence of the collated array? Is it 100%? Is it 80%? Is it an average of the three scores (90%)?

Previously, the collated result would always take the average of the individual results's confidence scores. The Confidence Mode property allows you to choose Average to take the average confidence score of the individual results, Min to take the smallest confidence score of all the individual results, or Max to take the largest confidence score.