2023.1:Collation Provider (Property): Difference between revisions

Revision as of 10:51, 2 May 2024

This article is about an older version of Grooper.

Information may be out of date and UI elements may have changed.

2025

2023.1

2023

The Collation property of a pin Data Type defines the method for converting its raw results into a final result set. It is configured by selecting a Collation Provider. The Collation Provider governs how initial matches from the Data Type's extractor(s) are combined and interpreted to produce the Data Type's final output.

You may download the ZIP(s) below and upload it into your own Grooper environment (version 2023.1). The first contains one or more Batches of sample documents. The second contains one or more Projects with resources used in examples throughout this article.

Glossary

AND: AND is a Collation Provider option for pin Data Type extractors. AND returns results only when each of its referenced or child extractors gets at least one hit, thus acting as a logical “AND” operator across multiple extractors.

Array: Array is a Collation Provider option for pin Data Type extractors. Array matches a list of values arranged in horizontal, vertical, or text-flow order, combining instances that qualify into a single result.

Batch: inventory_2 Batch objects are fundamental in Grooper's architecture as they are the containers of documents that get moved through Grooper's workflow mechanisms known as settings Batch Processes.

Classification: Classification is the process of identifying and organizing documents into categorical types based on their content or layout. Classification is key for efficient document management and data extraction workflows. Grooper has different methods for classifying documents. These include methods that use machine learning and text pattern recognition. In a Grooper Batch Process, the Classify Activity will assign a Content Type to a folder Batch Folder.

Collation Provider: The Collation property of a pin Data Type defines the method for converting its raw results into a final result set. It is configured by selecting a Collation Provider. The Collation Provider governs how initial matches from the Data Type's extractor(s) are combined and interpreted to produce the Data Type's final output.

Combine: Combine is a Collation Provider option for pin Data Type extractors. Combine combines instances from returned results based on a specified grouping, controlling how extractor results are assembled together for output.

Data Model: data_table Data Model node objects serve as the top-tier structure defining the taxonomy for Data Elements and are leveraged during the Extract Activity to extract data from a folder Batch Folders. They are a hierarchy of Data Elements that sets the stage for the extraction logic and review of data collected from documents.

Data Type: pin Data Type objects hold a collection of child, referenced, and locally defined Data Extractors and settings that manage how multiple (even differing) matches from Data Extractors are consolidated (via Collation) into a result set.

Extract: export_notes Extract is an Activity that retrieves information from folder Batch Folder documents, as defined by Data Elements in a data_table Data Model. This is how Grooper locates unstructured data on your documents and collects it in a structured, usable format.

Key-Value List: Key-Value List is a Collation Provider option for pin Data Type extractors. Key-Value List matches instances where a key and a list of one or more values appear together on the document, adhering to a specific layout pattern.

Key-Value Pair: Key-Value Pair is a Collation Provider option for pin Data Type extractors. Key-Value Pair matches instances where a key is paired with a value on the document in a specific layout. Note: Key-Value Pair is an older technique in Grooper. In most cases, the Labeled Value extractor type is preferable to Key-Value Pair collation.

Labeled Value: Labeled Value is an Extractor Type that identifies and extracts a value next to a label. This is one of the most commonly used extractors to extract data from structured documents (such as a standardized form) and static values on semi-structured documents (such as the header details on an invoice).

Multi-Column: Multi-Column is a Collation Provider option for pin Data Type extractors. Multi-Column combines multiple columns on a page into a single column for extraction.

Ordered Array: Ordered Array is a Collation Provider option for pin Data Type extractors. Ordered Array finds sequences of values where one result is present for each extractor, in the order they appear, according to a specified horizontal, vertical or text-flow layout.

Pattern-Based: Pattern-Based is a Collation Provider option for pin Data Type extractors. Pattern-Based uses regular expressions to sequence returned results into a final result set.

Project: package_2 Project node objects are the primary containers for configuration nodes within Grooper. The Project is where various processing objects such as stacks Content Models, settings Batch Processes, profile objects, and more are organized and managed. It allows for the encapsulation and modularization of these resources for easier management and reusability.

Row Match: The Row Match Table Extract Method uses regular expression pattern matching to determine a tables structure based on the pattern of each row and extract cell data from each column.

Separate: insert_page_break Separate is an Activity that sorts contract Batch Pages into individual folder Batch Folders. This distinguishes "loose pages" from the documents formed by those pages. Once loose pages are separated into Batch Folder documents, they can be further processed by unknown_document Classify, export_notes Extract, output Export and other Activities that need to run on the folder (i.e. document) level.

Split: Split is a Collation Provider option for pin Data Type extractors. Split separates a data instance at each match returned by the Data Type. The results are used as anchor points to "split" text into one or more smaller parts.

Table Extraction: "Table Extraction" refers to Grooper's ability to extract data from cells in tables on documents. This is accomplished by configuring the table Data Table and its child view_column Data Column elements in a data_table Data Model.

About

Data Type extractors in Grooper use regular expression to match a document's text data in order to return a particular piece of information. Extractors serve a variety of purposes. They can be used to populate fields in a Data Model, to separate and classify documents, to break up a document into sections, and more. For the most part, any time part of document's text data is needed or useful to do something, you need an extractor to find and return it.

Often, this requires something more complex than returning a single result. The relationships between multiple extraction results are often important. The fact results are physically related to each other on the page, or text exists between one or more results, or results are in one order versus another can be used accomplish various goals in Grooper.

The following Collation Providers are available in Grooper:

Individual - The Collation property is set to Individual by default. Each result is returned individually.
Combine - Takes multiple extracted instances and combines them into a single result.
AND - Returns a result when the extractor gets at least one hit. Useful in Classification.
Key-Value Pair - Matches a "Key" with a "Value" on a document to collect information following a label. This Collation method is not as commonly used anymore as the Labeled Value extractor is an easier way to get similar results.
Key-Value List - Returns a list of results based on the layout relationship to a "Key" or label.
Array - Combines a list of values that are arranged in a horizontal, vertical, or flow layout into a single result. Can be useful for Row Match Table Extraction.
Ordered Array - Combines a list of values that are arranged in a horizontal, vertical, or flow layout into a single result. However, unlike Array, Ordered Array the order in which the individual values are extracted matters and each extractor must return a value. Also useful in Row Match Table Extraction.
Split - Separates a data instance at each match returned by a Data Type. Useful for splitting up a document into smaller segments for more accurate extraction.
Pattern-Based - Uses regular expressions to sequence returned results into a final result set.
Multi-Column - Combines multiple columns on a page into a single column result for easier and more accurate extraction.

@@ Line 13: / Line 13: @@
 == Glossary ==
-<u>'''AND'''</u>: {{#lst:Glossary|AND}}
+<u><big>'''AND'''</big></u>: {{#lst:Glossary|AND}}
-<u>'''Array'''</u>: {{#lst:Glossary|Array}}
+<u><big>'''Array'''</big></u>: {{#lst:Glossary|Array}}
-<u>'''Batch'''</u>: {{#lst:Glossary|Batch}}
+<u><big>'''Batch'''</big></u>: {{#lst:Glossary|Batch}}
-<u>'''Classification'''</u>: {{#lst:Glossary|Classification}}
+<u><big>'''Classification'''</big></u>: {{#lst:Glossary|Classification}}
-<u>'''Collation Provider'''</u>: {{#lst:Glossary|Collation Provider}}
+<u><big>'''Collation Provider'''</big></u>: {{#lst:Glossary|Collation Provider}}
 <u>'''Combine'''</u>: {{#lst:Glossary|Combine}}
@@ Line 52: / Line 52: @@
 <u>'''Table Extraction'''</u>: {{#lst:Glossary|Table Extraction}}
 == About ==