2023:Collation Provider (Property): Difference between revisions

From Grooper Wiki
No edit summary
Tag: Manual revert
 
(9 intermediate revisions by 3 users not shown)
Line 1: Line 1:
{|cellpadding=10 cellspacing=5 style="margin:12px"
{{AutoVersion}}
|-style="background-color:#ed2330; color:white"
|style="font-size:14pt"|'''WIP'''
|
This article is a work-in-progress or created as a placeholder for testing purposes.  This article is subject to change and/or expansion.  It may be incomplete, inaccurate, or stop abruptly.


This tag will be removed upon draft completion.
{{stubs}}
|}


<onlyinclude>
<blockquote>{{#lst:Glossary|Collation Provider}}</blockquote>
<blockquote style="font-size:14pt">
'''''Collation Providers''''' allow '''[[Data Type]]''' extractor results to be combined, organized, or utilized in specific ways.
</blockquote>


Results can be combined, organized into arrays, returned as a key-value pair's value, and more.
Results can be combined, organized into arrays, returned as a key-value pair's value, and more.
Line 27: Line 19:
* Pattern-Based
* Pattern-Based
* Multi-Column
* Multi-Column
</onlyinclude>
 
{|class="download-box"
|
[[File:Asset 22@4x.png]]
|
You may download the ZIP(s) below and upload it into your own Grooper environment (version 2023.1). The first contains one or more '''Batches''' of sample documents. The second contains one or more '''Projects''' with resources used in examples throughout this article.
* [[Media:2023.1_Wiki_Collation-Provider_Batch.zip]]
* [[Media:2023.1_Wiki_Collation-Provider_Project.zip]]
|}


== About ==
== About ==
'''[[Data Type]]''' extractors in Grooper use regular expression to match a document's text data in order to return a particular piece of information. Extractors serve a variety of purposes. They can be used to populate fields in a [[Data Model]], to separate and classify documents, to break up a document into sections, and more. For the most part, any time part of document's text data is needed or useful to do something, you need an extractor to find and return it.


'''[[Data Type]]''' extractors in Grooper use regular expression to match a document's text data in order to return a particular piece of information. Extractors serve a variety of purposes. They can be used to populate fields in a [[Data Model]], to separate and classify documents, to break up a document into sections, and more.  For the most part, any time part of document's text data is needed or useful to do something, you need an extractor to find and return it.
Often, this requires something more complex than returning a single result. The relationships between multiple extraction results are often important. The fact results are physically related to each other on the page, or text exists between one or more results, or results are in one order versus another can be used accomplish various goals in Grooper.


Often, this requires something more complex than returning a single result.  The relationships between multiple extraction results are often important.  The fact results are physically related to each other on the page, or text exists between one or more results, or results are in one order versus another can be used accomplish various goals in Grooper.
For example, the '''''Individual''''', '''''Array''''', and '''''Ordered Array''''' Collation Providers all collate results differently.


For example, the ''Individual'', ''Array'', and ''Ordered Array'' Collation Providers all collate results differently.
=== Individual, Array, and Ordered Array ===
<tabs style="margin:20px">
<tab name="Individual" style="margin:20px">
=== Individual ===
=== Individual ===
The ''Individual'' Collation Provider returns all extraction results individually. This is the default Collation Provider for '''Data Type''' extractors.


The ''Individual'' Collation Provider returns all extraction results individually.  This is the default Collation Provider for '''Data Type''' extractors.
#This '''Data Type''' extractor, whose results are seen here, has five child extractors, all passing their own results up to the parent extractor. The child extractors are as follows:
 
#* Name - This '''Data Type''' extractor returns names on the document. Here, US presidents.
{|cellpadding=10 cellspacing=5
#* Date - This '''Data Type''' extractor returns dates. Here, president's birthdays and inauguration days, depending on the table.
|style="width:40%" valign=top|
#This '''Data Type''' extractor, whose results are seen here, has five child extractors, all passing their own results up to the parent extractor. The child extractors are as follows:
#* Name - This '''Data Type''' extractor returns names on the document. Here, US presidents.
#* Date - This '''Data Type''' extractor returns dates. Here, president's birthdays and inauguration days, depending on the table.
#* City/State - This '''Data Type''' extractor returns the city and state values listed for a president's birthday.
#* City/State - This '''Data Type''' extractor returns the city and state values listed for a president's birthday.
#* Number of Days - This '''Data Format''' extractor returns numbers. Here, the number of days in office.
#* Number of Days - This '''Data Format''' extractor returns numbers. Here, the number of days in office.
#* Party - This '''Data Format''' extractor returns the results of a list of political party names.
#* Party - This '''Data Format''' extractor returns the results of a list of political party names.
# The parent '''Data Type's''' '''''Collation''''' property is set to ''Individual''.
# The parent '''Data Type's''' '''''Collation''''' property is set to ''Individual''.
#* The '''''Collation''''' property determines the Collation Provider used.
#* The '''''Collation''''' property determines the Collation Provider used.
# You can see in the "Results" panel, everything each child extractor returns to the parent '''Data Type''' is listed as a distinct, individual result. A total of 56 results with the first item physically on the page listed first.
# You can see in the "Results" panel, everything each child extractor returns to the parent '''Data Type''' is listed as a distinct, individual result. A total of 56 results with the first item physically on the page listed first.
|
 
[[File:2023_Collation_Provider_About_Individual_Array_and_Ordered_Array_Individual_01.png]]
[[File:2023_Collation_Provider_About_Individual_Array_and_Ordered_Array_Individual_01.png]]
|}
 
</tab>
<tab name="Array" style="margin:20px">
=== Array ===
=== Array ===
The '''''Array''''' provider organizes and returns results much differently.


The ''Array'' Collation Provider organizes and returns results much differently.
* First, it will only return results if multiple extraction results are lined up in a particular order on the page, according to the "layout" set for this provider. For example, an '''''Array''''' collated extractor using a '''''Horizontal Layout''''' will only return results if they are aligned horizontally, one result after another from left to right.
* First, it will only return results if multiple extraction results are lined up in a particular order on the page, according to the "layout" set for this provider. For example, an ''Array'' collated extractor using a '''''Horizontal Layout''''' will only return results if they are aligned horizontally, one result after another from left to right.
* Second, instead of each result being returned individually, all results meeting the layout requirements are returned as a single value.  
* Second, instead of each result being returned individually, all results meeting the layout requirements are returned as a single value.


Essentially, an ''Array'' collated result is a collection of results who share a layout relationship, that are all lined up together (either horizontally, vertically, or in the left/right and top/bottom text flow of the document).
Essentially, an '''''Array''''' collated result is a collection of results who share a layout relationship, that are all lined up together (either horizontally, vertically, or in the left/right and top/bottom text flow of the document).


{|cellpadding=10 cellspacing=5
# This '''Data Type''' extractor has the exact same child extractors, but uses the '''''Array''''' provider instead of ''Individual''.
|style="width:40%" valign=top|
# This '''Data Type''' extractor has the exact same child extractors, but uses the ''Array'' Collation Provider instead of ''Individual''.
# The parent '''Data Type's''' '''''Collation''''' property is set to ''Array''.
# The parent '''Data Type's''' '''''Collation''''' property is set to ''Array''.
# The '''''Minimum Elements''''' property defaults to ''2''.
# The '''''Minimum Elements''''' property defaults to ''2''.
#* This means the array must contain ''at least'' two extraction results. For this extractor, it could be a name and a date. It could be two dates. It could be forty dates. It could be a name, a date, a city/state location, a number, and a political party. It doesn't matter, as long as there are two results.
#* This means the array must contain ''at least'' two extraction results. For this extractor, it could be a name and a date. It could be two dates. It could be forty dates. It could be a name, a date, a city/state location, a number, and a political party. It doesn't matter, as long as there are two results.
# The '''''Horizontal Layout''''' is set to ''Enabled''.  
# The '''''Horizontal Layout''''' is set to ''Enabled''.  
#* At least one of the three '''''Layout''''' properties must be enabled. Using the '''''Horizontal Layout''''', only results aligned horizontally with each other will count as an array. Here, this effectively returns full rows of each table, since one extraction result follows the other from left to right in a horizontal line.
#* At least one of the three '''''Layout''''' properties must be enabled. Using the '''''Horizontal Layout''''', only results aligned horizontally with each other will count as an array. Here, this effectively returns full rows of each table, since one extraction result follows the other from left to right in a horizontal line.
# Notice we now only return 12 results. Rather than each individual result from each child extractor, results are ordered and returned according to the ''Array'' Collation Provider and its configuration.
# Notice we now only return 12 results. Rather than each individual result from each child extractor, results are ordered and returned according to the ''Array'' Collation Provider and its configuration.
#* Results are combined into a single result, as long as they are aligned horizontally with each other.
#* Results are combined into a single result, as long as they are aligned horizontally with each other.
|
 
[[File:2023_Collation_Provider_About_Individual_Array_and_Ordered_Array_Array_01.png]]
[[File:2023_Collation_Provider_About_Individual_Array_and_Ordered_Array_Array_01.png]]
|}
 
</tab>
<tab name="Ordered Array" style="margin:20px">
=== Ordered Array ===
=== Ordered Array ===
The ''''Ordered Array''''' provider is similar to the '''''Array''''' provider, but it is much more restrictive about how allowable results can be organized. ''Only'' arrays whose extracted elements are in the listed order of the children extractors are returned.


The ''Ordered Array'' Collation Provider is similar to the ''Array'' provider, but it is much more restrictive about how allowable results can be organized.  ''Only'' arrays whose extracted elements are in the listed order of the children extractors are returned.
# This '''Data Type''' extractor has the exact same child extractors, but uses the '''''Ordered Array''''' provider.
 
#* Notice the order of each child extractor. First "Name" then "Date" then "City/State" then "Number of Days" and last, "Party"
{|cellpadding=10 cellspacing=5
|style="width:40%" valign=top|
# This '''Data Type''' extractor has the exact same child extractors, but uses the ''Ordered Array'' Collation Provider.
#* Notice the order of each child extractor. First "Name" then "Date" then "City/State" then "Number of Days" and last, "Party"
# The parent '''Data Type's''' '''''Collation''''' property is set to ''Ordered Array''.
# The parent '''Data Type's''' '''''Collation''''' property is set to ''Ordered Array''.
# The '''''Horizontal Layout''''' is set to ''Enabled'', just like our example of the ''Array'' Collation Provider.
# The '''''Horizontal Layout''''' is set to ''Enabled'', just like our example of the '''''Array''''' provider.
# Notice several arrays were tossed out of our "Results" list.
# Notice several arrays were tossed out of our "Results" list.
#* The second table has the "Days in Office" column ''before'' the "Birthplace". When it came time to finding that array, all the elements are there, but they are not in the order of the child elements in the Node Tree. The "Number of Days" extractor comes ''after'' the "City/State" extractor.
#* The second table has the "Days in Office" column ''before'' the "Birthplace". When it came time to finding that array, all the elements are there, but they are not in the order of the child elements in the Node Tree. The "Number of Days" extractor comes ''after'' the "City/State" extractor.
#* The third table has four out of the five elements present, and in the right order, but is missing the "Political Party" column (picked up by the "Party" extractor). Not only must the array's extracted elements match the order of the child extractors locating them, but ''all'' elements must be present.
#* The third table has four out of the five elements present, and in the right order, but is missing the "Political Party" column (picked up by the "Party" extractor). Not only must the array's extracted elements match the order of the child extractors locating them, but ''all'' elements must be present.
|
 
[[File:2023_Collation_Provider_About_Individual_Array_and_Ordered_Array_Ordered_Array_01.png]]
[[File:2023_Collation_Provider_About_Individual_Array_and_Ordered_Array_Ordered_Array_01.png]]
</tab>
</tabs>


=== Key-Value Pair and Key-Value List ===
=== Key-Value Pair and Key-Value List ===
Line 113: Line 94:


=== Multi-Column ===
=== Multi-Column ===
== Version Differences ==
=== The AND Collation Provider ===
The ''AND'' Collation Provider is a brand new provider in version 2.90.  Aspects of this collation could have been approximated using the ''Combine'' or ''Array'' providers, in some cases.  However, it's functionality is unique and distinct from these two Collation Providers.


=== Confidence Mode ===
=== Confidence Mode ===
The '''''Confidence Mode''''' property is new to the '''''Combine''''', '''''AND''''', '''''Array''''', '''''Ordered Array''''', '''''Key-Value Pair''''' and '''''Key-Value List''''' Collation Providers in version 2.90. Each of these providers orders and/or combines multiple extraction results in various ways. If any of results are matched using ''FuzzyRegEx'', the overall confidence score of the collated result must be determined, in some way, based on the confidence of each individual result.


The '''''Confidence Mode''''' property is new to the ''Combine'', ''AND'', ''Array'', ''Ordered Array'', ''Key-Value Pair'' and ''Key-Value List'' Collation Providers in version 2.90.  Each of these providers orders and/or combines multiple extraction results in various ways.  If any of results are matched using ''FuzzyRegEx'', the overall confidence score of the collated result must be determined, in some way, based on the confidence of each individual result. 
For example, an '''''Array''''' may have three results as its elements. Result 1 may have a confidence score of 100%. Result 2 may have a confidence score of 90%. Result 3 may have a confidence score of 80%. So, what is the confidence of the collated array?  Is it 100%?  Is it 80%?  Is it an average of the three scores (90%)?   
 
For example, an ''Array'' may have three results as its elements. Result 1 may have a confidence score of 100%. Result 2 may have a confidence score of 90%. Result 3 may have a confidence score of 80%. So, what is the confidence of the collated array?  Is it 100%?  Is it 80%?  Is it an average of the three scores (90%)?   
 
Previously, the collated result would always take the average of the individual results's confidence scores.  The '''''Confidence Mode''''' property allows you to choose ''Average'' to take the average confidence score of the individual results, ''Min'' to take the smallest confidence score of all the individual results, or ''Max'' to take the largest confidence score.


[[Category:Articles]]
Previously, the collated result would always take the average of the individual results's confidence scores. The '''''Confidence Mode''''' property allows you to choose ''Average'' to take the average confidence score of the individual results, ''Min'' to take the smallest confidence score of all the individual results, or ''Max'' to take the largest confidence score.
[[Category:Stub]]

Latest revision as of 16:34, 12 May 2025

This article is about an older version of Grooper.

Information may be out of date and UI elements may have changed.

20252023.120232.90

STUB

This article is a stub. It contains minimal information on the topic and should be expanded.

The Collation property of a pin Data Type defines the method for converting its raw results into a final result set. It is configured by selecting a Collation Provider. The Collation Provider governs how initial matches from the Data Type's extractor(s) are combined and interpreted to produce the Data Type's final output.

Results can be combined, organized into arrays, returned as a key-value pair's value, and more.

The following Collation Providers are available in Grooper:

  • Individual
  • Combine
  • AND
  • Key-Value Pair
  • Key-Value List
  • Array
  • Ordered Array
  • Split
  • Pattern-Based
  • Multi-Column

You may download the ZIP(s) below and upload it into your own Grooper environment (version 2023.1). The first contains one or more Batches of sample documents. The second contains one or more Projects with resources used in examples throughout this article.

About

Data Type extractors in Grooper use regular expression to match a document's text data in order to return a particular piece of information. Extractors serve a variety of purposes. They can be used to populate fields in a Data Model, to separate and classify documents, to break up a document into sections, and more. For the most part, any time part of document's text data is needed or useful to do something, you need an extractor to find and return it.

Often, this requires something more complex than returning a single result. The relationships between multiple extraction results are often important. The fact results are physically related to each other on the page, or text exists between one or more results, or results are in one order versus another can be used accomplish various goals in Grooper.

For example, the Individual, Array, and Ordered Array Collation Providers all collate results differently.

Individual

The Individual Collation Provider returns all extraction results individually. This is the default Collation Provider for Data Type extractors.

  1. This Data Type extractor, whose results are seen here, has five child extractors, all passing their own results up to the parent extractor. The child extractors are as follows:
    • Name - This Data Type extractor returns names on the document. Here, US presidents.
    • Date - This Data Type extractor returns dates. Here, president's birthdays and inauguration days, depending on the table.
    • City/State - This Data Type extractor returns the city and state values listed for a president's birthday.
    • Number of Days - This Data Format extractor returns numbers. Here, the number of days in office.
    • Party - This Data Format extractor returns the results of a list of political party names.
  2. The parent Data Type's Collation property is set to Individual.
    • The Collation property determines the Collation Provider used.
  3. You can see in the "Results" panel, everything each child extractor returns to the parent Data Type is listed as a distinct, individual result. A total of 56 results with the first item physically on the page listed first.

Array

The Array provider organizes and returns results much differently.

  • First, it will only return results if multiple extraction results are lined up in a particular order on the page, according to the "layout" set for this provider. For example, an Array collated extractor using a Horizontal Layout will only return results if they are aligned horizontally, one result after another from left to right.
  • Second, instead of each result being returned individually, all results meeting the layout requirements are returned as a single value.

Essentially, an Array collated result is a collection of results who share a layout relationship, that are all lined up together (either horizontally, vertically, or in the left/right and top/bottom text flow of the document).

  1. This Data Type extractor has the exact same child extractors, but uses the Array provider instead of Individual.
  2. The parent Data Type's Collation property is set to Array.
  3. The Minimum Elements property defaults to 2.
    • This means the array must contain at least two extraction results. For this extractor, it could be a name and a date. It could be two dates. It could be forty dates. It could be a name, a date, a city/state location, a number, and a political party. It doesn't matter, as long as there are two results.
  4. The Horizontal Layout is set to Enabled.
    • At least one of the three Layout properties must be enabled. Using the Horizontal Layout, only results aligned horizontally with each other will count as an array. Here, this effectively returns full rows of each table, since one extraction result follows the other from left to right in a horizontal line.
  5. Notice we now only return 12 results. Rather than each individual result from each child extractor, results are ordered and returned according to the Array Collation Provider and its configuration.
    • Results are combined into a single result, as long as they are aligned horizontally with each other.

Ordered Array

The 'Ordered Array provider is similar to the Array provider, but it is much more restrictive about how allowable results can be organized. Only arrays whose extracted elements are in the listed order of the children extractors are returned.

  1. This Data Type extractor has the exact same child extractors, but uses the Ordered Array provider.
    • Notice the order of each child extractor. First "Name" then "Date" then "City/State" then "Number of Days" and last, "Party"
  2. The parent Data Type's Collation property is set to Ordered Array.
  3. The Horizontal Layout is set to Enabled, just like our example of the Array provider.
  4. Notice several arrays were tossed out of our "Results" list.
    • The second table has the "Days in Office" column before the "Birthplace". When it came time to finding that array, all the elements are there, but they are not in the order of the child elements in the Node Tree. The "Number of Days" extractor comes after the "City/State" extractor.
    • The third table has four out of the five elements present, and in the right order, but is missing the "Political Party" column (picked up by the "Party" extractor). Not only must the array's extracted elements match the order of the child extractors locating them, but all elements must be present.

Key-Value Pair and Key-Value List

Combine (and Combine Methods)

Split

Pattern-Based

AND

Multi-Column

Confidence Mode

The Confidence Mode property is new to the Combine, AND, Array, Ordered Array, Key-Value Pair and Key-Value List Collation Providers in version 2.90. Each of these providers orders and/or combines multiple extraction results in various ways. If any of results are matched using FuzzyRegEx, the overall confidence score of the collated result must be determined, in some way, based on the confidence of each individual result.

For example, an Array may have three results as its elements. Result 1 may have a confidence score of 100%. Result 2 may have a confidence score of 90%. Result 3 may have a confidence score of 80%. So, what is the confidence of the collated array? Is it 100%? Is it 80%? Is it an average of the three scores (90%)?

Previously, the collated result would always take the average of the individual results's confidence scores. The Confidence Mode property allows you to choose Average to take the average confidence score of the individual results, Min to take the smallest confidence score of all the individual results, or Max to take the largest confidence score.