2023:Collation Provider (Property): Difference between revisions
Dgreenwood (talk | contribs) m Dgreenwood moved page Collation Provider - 2023 to 2023:Collation Provider without leaving a redirect |
|
(No difference)
| |
Revision as of 13:58, 28 December 2023
| WIP |
This article is a work-in-progress or created as a placeholder for testing purposes. This article is subject to change and/or expansion. It may be incomplete, inaccurate, or stop abruptly. This tag will be removed upon draft completion. |
Collation Providers allow Data Type extractor results to be combined, organized, or utilized in specific ways.
Results can be combined, organized into arrays, returned as a key-value pair's value, and more.
The following Collation Providers are available in Grooper:
- Individual
- Combine
- AND
- Key-Value Pair
- Key-Value List
- Array
- Ordered Array
- Split
- Pattern-Based
- Multi-Column
About
Data Type extractors in Grooper use regular expression to match a document's text data in order to return a particular piece of information. Extractors serve a variety of purposes. They can be used to populate fields in a Data Model, to separate and classify documents, to break up a document into sections, and more. For the most part, any time part of document's text data is needed or useful to do something, you need an extractor to find and return it.
Often, this requires something more complex than returning a single result. The relationships between multiple extraction results are often important. The fact results are physically related to each other on the page, or text exists between one or more results, or results are in one order versus another can be used accomplish various goals in Grooper.
For example, the Individual, Array, and Ordered Array Collation Providers all collate results differently.
Individual, Array, and Ordered Array
Individual
The Individual Collation Provider returns all extraction results individually. This is the default Collation Provider for Data Type extractors.
|
Array
The Array Collation Provider organizes and returns results much differently.
- First, it will only return results if multiple extraction results are lined up in a particular order on the page, according to the "layout" set for this provider. For example, an Array collated extractor using a Horizontal Layout will only return results if they are aligned horizontally, one result after another from left to right.
- Second, instead of each result being returned individually, all results meeting the layout requirements are returned as a single value.
Essentially, an Array collated result is a collection of results who share a layout relationship, that are all lined up together (either horizontally, vertically, or in the left/right and top/bottom text flow of the document).
|
Ordered Array
The Ordered Array Collation Provider is similar to the Array provider, but it is much more restrictive about how allowable results can be organized. Only arrays whose extracted elements are in the listed order of the children extractors are returned.
|
Key-Value Pair and Key-Value List
Combine (and Combine Methods)
Split
Pattern-Based
AND
Multi-Column
Version Differences
The AND Collation Provider
The AND Collation Provider is a brand new provider in version 2.90. Aspects of this collation could have been approximated using the Combine or Array providers, in some cases. However, it's functionality is unique and distinct from these two Collation Providers.
Confidence Mode
The Confidence Mode property is new to the Combine, AND, Array, Ordered Array, Key-Value Pair and Key-Value List Collation Providers in version 2.90. Each of these providers orders and/or combines multiple extraction results in various ways. If any of results are matched using FuzzyRegEx, the overall confidence score of the collated result must be determined, in some way, based on the confidence of each individual result.
For example, an Array may have three results as its elements. Result 1 may have a confidence score of 100%. Result 2 may have a confidence score of 90%. Result 3 may have a confidence score of 80%. So, what is the confidence of the collated array? Is it 100%? Is it 80%? Is it an average of the three scores (90%)?
Previously, the collated result would always take the average of the individual results's confidence scores. The Confidence Mode property allows you to choose Average to take the average confidence score of the individual results, Min to take the smallest confidence score of all the individual results, or Max to take the largest confidence score.


