2023:Collation Provider (Property)

From Grooper Wiki

This article is about an older version of Grooper.

Information may be out of date and UI elements may have changed.

20252023.120232.90

STUB

This article is a stub. It contains minimal information on the topic and should be expanded.

The Collation property of a pin Data Type defines the method for converting its raw results into a final result set. It is configured by selecting a Collation Provider. The Collation Provider governs how initial matches from the Data Type's extractor(s) are combined and interpreted to produce the Data Type's final output.

Results can be combined, organized into arrays, returned as a key-value pair's value, and more.

The following Collation Providers are available in Grooper:

  • Individual
  • Combine
  • AND
  • Key-Value Pair
  • Key-Value List
  • Array
  • Ordered Array
  • Split
  • Pattern-Based
  • Multi-Column

You may download the ZIP(s) below and upload it into your own Grooper environment (version 2023.1). The first contains one or more Batches of sample documents. The second contains one or more Projects with resources used in examples throughout this article.

Glossary

Activity: Activity is a property on edit_document Batch Process Steps. Activities define specific document processing operations done to a inventory_2 Batch, folder Batch Folder, or contract Batch Page. Batch Process Steps configured with specific Activities are frequently referred by the name of the Activity followed by the word "step". For example: Classify step.

AND: AND is a Collation Provider option for pin Data Type extractors. AND returns results only when each of its referenced or child extractors gets at least one hit, thus acting as a logical “AND” operator across multiple extractors.

Array: Array is a Collation Provider option for pin Data Type extractors. Array matches a list of values arranged in horizontal, vertical, or text-flow order, combining instances that qualify into a single result.

Batch Folder: folder Batch Folder objects are defined as container objects within a inventory_2 Batch that are used to represent and organize both folders and pages. They can hold other Batch Folders or contract Batch Page objects as children. The Batch Folder acts as an organizational unit within a Batch, allowing for a structured approach to managing and processing a collection of documents.

  • Batch Folders are frequently referred to simply as "documents".

Batch Page: contract Batch Page objects represent individual pages within a inventory_2 Batch. The Batch Page object is the most granular unit in the hierarchy of Batch Objects in Grooper.

  • Batch Pages are frequently referred to simply as "pages".

Batch Process: settings Batch Process objects are crucial components in Grooper's architecture. A Batch Process orchestrates the document processing strategy and ensures each inventory_2 Batch of documents is managed systematically and efficiently.

  • Batch Processes by themselves do nothing. Instead, the workflows they execute are designed by adding child edit_document Batch Process Steps.
  • A Batch Process is often referred to as simply a "process".

Batch: inventory_2 Batch objects are fundamental in Grooper's architecture as they are the containers of documents that get moved through Grooper's workflow mechanisms known as settings Batch Processes.

Classification: Classification is the process of identifying and organizing documents into categorical types based on their content or layout. Classification is key for efficient document management and data extraction workflows. Grooper has different methods for classifying documents. These include methods that use machine learning and text pattern recognition. In a Grooper Batch Process, the Classify Activity will assign a Content Type to a folder Batch Folder.

Classify: unknown_document Classify is an Activity that "classifies" folder Batch Folders in a inventory_2 Batch by assigning them a Content Type (e.g. a description Document Type) using patterns, lexical understanding, or rules as defined by a stacks Content Model.

Collation Provider: The Collation property of a pin Data Type defines the method for converting its raw results into a final result set. It is configured by selecting a Collation Provider. The Collation Provider governs how initial matches from the Data Type's extractor(s) are combined and interpreted to produce the Data Type's final output.

Combine: Combine is a Collation Provider option for pin Data Type extractors. Combine combines instances from returned results based on a specified grouping, controlling how extractor results are assembled together for output.

Content Model: stacks Content Model node objects define the taxonomy of document sets in terms of the description Document Type they contain. They also house the Data Elements that appear on each collections_bookmark Content Category and Document Type within them. Content Models serve as the root of a Content Type hierarchy and are crucial for organizing the different types of documents that Grooper can recognize and process.

Content Type: Content Type refers to objects in Grooper used to classify folder Batch Folders. These include: stacks Content Models, collections_bookmark Content Categories, and description Document Types.

Data Column: view_column Data Column node objects are child objects of a table Data Table, representing individual columns and defining the type of data each column holds along with its data extraction properties.

Data Element: Data Element refers to the objects in Grooper used to collect data from a document. These include: data_table Data Models, insert_page_break Data Sections, variables Data Fields, table Data Tables, and view_column Data Columns.

Data Extractor: Data Extractor (or just "extractor") refers to all Extractor Types and extractor node objects. Extractors define the logic used to return data from a document's text content, including general data (such as a date) and specific data (such as an agreement date on a contract).

Data Model: data_table Data Model node objects serve as the top-tier structure defining the taxonomy for Data Elements and are leveraged during the Extract Activity to extract data from a folder Batch Folders. They are a hierarchy of Data Elements that sets the stage for the extraction logic and review of data collected from documents.

Data Table: table Data Table objects are utilized for extracting repeating data that's formatted in rows and columns, allowing for complex multi-instance data organization that would be present in table-formatted content.

Data Type: pin Data Type objects hold a collection of child, referenced, and locally defined Data Extractors and settings that manage how multiple (even differing) matches from Data Extractors are consolidated (via Collation) into a result set.

Extract: export_notes Extract is an Activity that retrieves information from folder Batch Folder documents, as defined by Data Elements in a data_table Data Model. This is how Grooper locates unstructured data on your documents and collects it in a structured, usable format.

Extractor Type: An Extractor Type (shorthand for Value Extractor Type) is configured for numerous properties on a wide array of Grooper objects. They are used to return "data instances" from documents for one purpose or another. The Extractor Type defines an operation that reads data from the text or visual content of a document and returns one or more results. Each different Extractor Type uses a specialized logic to return results. Extractor Types are consumed by higher-level objects such as Data Elements, extractor objects, Content Types and more.

Key-Value List: Key-Value List is a Collation Provider option for pin Data Type extractors. Key-Value List matches instances where a key and a list of one or more values appear together on the document, adhering to a specific layout pattern.

Key-Value Pair: Key-Value Pair is a Collation Provider option for pin Data Type extractors. Key-Value Pair matches instances where a key is paired with a value on the document in a specific layout. Note: Key-Value Pair is an older technique in Grooper. In most cases, the Labeled Value extractor type is preferable to Key-Value Pair collation.

Labeled Value: Labeled Value is an Extractor Type that identifies and extracts a value next to a label. This is one of the most commonly used extractors to extract data from structured documents (such as a standardized form) and static values on semi-structured documents (such as the header details on an invoice).

Machine: computer Machine node objects represent servers that have connected to the Grooper Repository. They allow for the management of Grooper Service instances and serve as a connection points for processing jobs to be executed on the server hardware. Machine objects are essential for the scaling of processing capabilities and for distributing processing loads across multiple servers.

Multi-Column: Multi-Column is a Collation Provider option for pin Data Type extractors. Multi-Column combines multiple columns on a page into a single column for extraction.

Ordered Array: Ordered Array is a Collation Provider option for pin Data Type extractors. Ordered Array finds sequences of values where one result is present for each extractor, in the order they appear, according to a specified horizontal, vertical or text-flow layout.

Pattern-Based: Pattern-Based is a Collation Provider option for pin Data Type extractors. Pattern-Based uses regular expressions to sequence returned results into a final result set.

Project: package_2 Project node objects are the primary containers for configuration nodes within Grooper. The Project is where various processing objects such as stacks Content Models, settings Batch Processes, profile objects, and more are organized and managed. It allows for the encapsulation and modularization of these resources for easier management and reusability.

Row Match: The Row Match Table Extract Method uses regular expression pattern matching to determine a tables structure based on the pattern of each row and extract cell data from each column.

Separate: insert_page_break Separate is an Activity that sorts contract Batch Pages into individual folder Batch Folders. This distinguishes "loose pages" from the documents formed by those pages. Once loose pages are separated into Batch Folder documents, they can be further processed by unknown_document Classify, export_notes Extract, output Export and other Activities that need to run on the folder (i.e. document) level.

Split: Split is a Collation Provider option for pin Data Type extractors. Split separates a data instance at each match returned by the Data Type. The results are used as anchor points to "split" text into one or more smaller parts.

Table Extract Method: A Table Extract Method defines the settings and logic for a table Data Table to perform extraction. It is set by configuring the Extract Method property of the Data Table.

Table Extraction: "Table Extraction" refers to Grooper's ability to extract data from cells in tables on documents. This is accomplished by configuring the table Data Table and its child view_column Data Column elements in a data_table Data Model.

About

Data Type extractors in Grooper use regular expression to match a document's text data in order to return a particular piece of information. Extractors serve a variety of purposes. They can be used to populate fields in a Data Model, to separate and classify documents, to break up a document into sections, and more. For the most part, any time part of document's text data is needed or useful to do something, you need an extractor to find and return it.

Often, this requires something more complex than returning a single result. The relationships between multiple extraction results are often important. The fact results are physically related to each other on the page, or text exists between one or more results, or results are in one order versus another can be used accomplish various goals in Grooper.

For example, the Individual, Array, and Ordered Array Collation Providers all collate results differently.

Individual

The Individual Collation Provider returns all extraction results individually. This is the default Collation Provider for Data Type extractors.

  1. This Data Type extractor, whose results are seen here, has five child extractors, all passing their own results up to the parent extractor. The child extractors are as follows:
    • Name - This Data Type extractor returns names on the document. Here, US presidents.
    • Date - This Data Type extractor returns dates. Here, president's birthdays and inauguration days, depending on the table.
    • City/State - This Data Type extractor returns the city and state values listed for a president's birthday.
    • Number of Days - This Data Format extractor returns numbers. Here, the number of days in office.
    • Party - This Data Format extractor returns the results of a list of political party names.
  2. The parent Data Type's Collation property is set to Individual.
    • The Collation property determines the Collation Provider used.
  3. You can see in the "Results" panel, everything each child extractor returns to the parent Data Type is listed as a distinct, individual result. A total of 56 results with the first item physically on the page listed first.

Array

The Array provider organizes and returns results much differently.

  • First, it will only return results if multiple extraction results are lined up in a particular order on the page, according to the "layout" set for this provider. For example, an Array collated extractor using a Horizontal Layout will only return results if they are aligned horizontally, one result after another from left to right.
  • Second, instead of each result being returned individually, all results meeting the layout requirements are returned as a single value.

Essentially, an Array collated result is a collection of results who share a layout relationship, that are all lined up together (either horizontally, vertically, or in the left/right and top/bottom text flow of the document).

  1. This Data Type extractor has the exact same child extractors, but uses the Array provider instead of Individual.
  2. The parent Data Type's Collation property is set to Array.
  3. The Minimum Elements property defaults to 2.
    • This means the array must contain at least two extraction results. For this extractor, it could be a name and a date. It could be two dates. It could be forty dates. It could be a name, a date, a city/state location, a number, and a political party. It doesn't matter, as long as there are two results.
  4. The Horizontal Layout is set to Enabled.
    • At least one of the three Layout properties must be enabled. Using the Horizontal Layout, only results aligned horizontally with each other will count as an array. Here, this effectively returns full rows of each table, since one extraction result follows the other from left to right in a horizontal line.
  5. Notice we now only return 12 results. Rather than each individual result from each child extractor, results are ordered and returned according to the Array Collation Provider and its configuration.
    • Results are combined into a single result, as long as they are aligned horizontally with each other.

Ordered Array

The 'Ordered Array provider is similar to the Array provider, but it is much more restrictive about how allowable results can be organized. Only arrays whose extracted elements are in the listed order of the children extractors are returned.

  1. This Data Type extractor has the exact same child extractors, but uses the Ordered Array provider.
    • Notice the order of each child extractor. First "Name" then "Date" then "City/State" then "Number of Days" and last, "Party"
  2. The parent Data Type's Collation property is set to Ordered Array.
  3. The Horizontal Layout is set to Enabled, just like our example of the Array provider.
  4. Notice several arrays were tossed out of our "Results" list.
    • The second table has the "Days in Office" column before the "Birthplace". When it came time to finding that array, all the elements are there, but they are not in the order of the child elements in the Node Tree. The "Number of Days" extractor comes after the "City/State" extractor.
    • The third table has four out of the five elements present, and in the right order, but is missing the "Political Party" column (picked up by the "Party" extractor). Not only must the array's extracted elements match the order of the child extractors locating them, but all elements must be present.

Key-Value Pair and Key-Value List

Combine (and Combine Methods)

Split

Pattern-Based

AND

Multi-Column

Confidence Mode

The Confidence Mode property is new to the Combine, AND, Array, Ordered Array, Key-Value Pair and Key-Value List Collation Providers in version 2.90. Each of these providers orders and/or combines multiple extraction results in various ways. If any of results are matched using FuzzyRegEx, the overall confidence score of the collated result must be determined, in some way, based on the confidence of each individual result.

For example, an Array may have three results as its elements. Result 1 may have a confidence score of 100%. Result 2 may have a confidence score of 90%. Result 3 may have a confidence score of 80%. So, what is the confidence of the collated array? Is it 100%? Is it 80%? Is it an average of the three scores (90%)?

Previously, the collated result would always take the average of the individual results's confidence scores. The Confidence Mode property allows you to choose Average to take the average confidence score of the individual results, Min to take the smallest confidence score of all the individual results, or Max to take the largest confidence score.