Collation Provider

From Grooper Wiki
Jump to navigation Jump to search

Collation Providers allow Data Type extractor results to be combined, organized, or utilized in specific ways.

Results can be combined, organized into arrays, returned as a key-value pair's value, and more.

The following Collation Providers are available in Grooper:

  • Individual
  • Combine
  • AND
  • Key-Value Pair
  • Key-Value List
  • Array
  • Ordered Array
  • Split
  • Pattern-Based
  • Multi-Column


About

Data Type extractors in Grooper use regular expression to match a document's text data in order to return a particular piece of information. Extractors serve a variety of purposes. They can be used to populate fields in a Data Model, to separate and classify documents, to break up a document into sections, and more. For the most part, any time part of document's text data is needed or useful to do something, you need an extractor to find and return it.

Often, this requires something more complex than returning a single result. The relationships between multiple extraction results are often important. The fact results are physically related to each other on the page, or text exists between one or more results, or results are in one order versus another can be used accomplish various goals in Grooper.

For example, the Individual, Array, and Ordered Array Collation Providers all collate results differently.

Individual, Array, and Ordered Array

Individual

The Individual Collation Provider returns all extraction results individually. This is the default Collation Provider for Data Type extractors.

  1. This Data Type extractor, whose results are seen here, has five child extractors, all passing their own results up to the parent extractor. The child extractors are as follows:
    • Name - This Data Type extractor returns names on the document. Here, US presidents.
    • Date - This Data Type extractor returns dates. Here, president's birthdays and inauguration days, depending on the table.
    • City/State - This Data Type extractor returns the city and state values listed for a president's birthday.
    • Number of Days - This Data Format extractor returns numbers. Here, the number of days in office.
    • Party - This Data Format extractor returns the results of a list of political party names.
  2. The parent Data Type's Collation property is set to Individual.
    • The Collation property determines the Collation Provider used.
  3. You can see in the "Results" panel, everything each child extractor returns to the parent Data Type is listed as a distinct, individual result. A total of 56 results with the first item physically on the page listed first.

Collation Provider - About 01.png

Array

The Array Collation Provider organizes and returns results much differently.

  • First, it will only return results if multiple extraction results are lined up in a particular order on the page, according to the "layout" set for this provider. For example, an Array collated extractor using a Horizontal Layout will only return results if they are aligned horizontally, one result after another from left to right.
  • Second, instead of each result being returned individually, all results meeting the layout requirements are returned as a single value.

Essentially, an Array collated result is a collection of results who share a layout relationship, that are all lined up together (either horizontally, vertically, or in the left/right and top/bottom text flow of the document).

  1. This Data Type extractor has the exact same child extractors, but uses the Array Collation Provider instead of Individual.
  2. The parent Data Type's Collation property is set to Array.
  3. The Minimum Elements property defaults to 2.
    • This means the array must contain at least two extraction results. For this extractor, it could be a name and a date. It could be two dates. It could be forty dates. It could be a name, a date, a city/state location, a number, and a political party. It doesn't matter, as long as there are two results.
  4. The Horizontal Layout is set to Enabled.
    • At least one of the three Layout properties must be enabled. Using the Horizontal Layout, only results aligned horizontally with each other will count as an array. Here, this effectively returns full rows of each table, since one extraction result follows the other from left to right in a horizontal line.
  5. Notice we now only return 12 results. Rather than each individual result from each child extractor, results are ordered and returned according to the Array Collation Provider and its configuration.
    • Results are combined into a single result, as long as they are aligned horizontally with each other.

Collation Provider - About 02.png

Ordered Array

The Ordered Array Collation Provider is similar to the Array provider, but it is much more restrictive about how allowable results can be organized. Only arrays whose extracted elements are in the listed order of the children extractors are returned.

  1. This Data Type extractor has the exact same child extractors, but uses the Ordered Array Collation Provider.
    • Notice the order of each child extractor. First "Name" then "Date" then "City/State" then "Number of Days" and last, "Party"
  2. The parent Data Type's Collation property is set to Ordered Array.
  3. The Horizontal Layout is set to Enabled, just like our example of the Array Collation Provider.
  4. Notice several arrays were tossed out of our "Results" list.
    • The second table has the "Days in Office" column before the "Birthplace". When it came time to finding that array, all the elements are there, but they are not in the order of the child elements in the Node Tree. The "Number of Days" extractor comes after the "City/State" extractor.
    • The third table has four out of the five elements present, and in the right order, but is missing the "Political Party" column (picked up by the "Party" extractor). Not only must the array's extracted elements match the order of the child extractors locating them, but all elements must be present.

Collation Provider - About 03.png

Key-Value Pair and Key-Value List

Combine (and Combine Methods)

Split

Pattern-Based

AND

Multi-Column

Version Differences

The AND Collation Provider

The AND Collation Provider is a brand new provider in version 2.90. Aspects of this collation could have been approximated using the Combine or Array providers, in some cases. However, it's functionality is unique and distinct from these two Collation Providers.

Confidence Mode

The Confidence Mode property is new to the Combine, AND, Array, Ordered Array, Key-Value Pair and Key-Value List Collation Providers in version 2.90. Each of these providers orders and/or combines multiple extraction results in various ways. If any of results are matched using FuzzyRegEx, the overall confidence score of the collated result must be determined, in some way, based on the confidence of each individual result.

For example, an Array may have three results as its elements. Result 1 may have a confidence score of 100%. Result 2 may have a confidence score of 90%. Result 3 may have a confidence score of 80%. So, what is the confidence of the collated array? Is it 100%? Is it 80%? Is it an average of the three scores (90%)?

Previously, the collated result would always take the average of the individual results's confidence scores. The Confidence Mode property allows you to choose Average to take the average confidence score of the individual results, Min to take the smallest confidence score of all the individual results, or Max to take the largest confidence score.