2023:Ordered Array (Collation Provider)

From Grooper Wiki

This article is about an older version of Grooper.

Information may be out of date and UI elements may have changed.

20252023

Ordered Array is a Collation Provider option for pin Data Type extractors. Ordered Array finds sequences of values where one result is present for each extractor, in the order they appear, according to a specified horizontal, vertical or text-flow layout.

You may download the ZIP(s) below and upload it into your own Grooper environment (version 2023). The first contains a Project with resources used in examples throughout this article. The second contains one or more Batches of sample documents.

About

Ordered Array is one of the Collation Providers and can be used for data organization depending on what you want to extract from your documents. All of the Collation Providers (except for Individual) essentially take multiple results and combine them into one. Ordered Array specifically only returns results based on the orientation of the information. If the data is lined up horizontally or vertically, you must select the corresponding layout property for Grooper to return the information.

From this basic information you might think that Ordered Array and Array just do the same thing. They are very similar, but Ordered Array has additional rules. The order of the data has to match the order of the child extractor objects under the object the collation is set on. Also, all of the child extractors must return something for Ordered Array to work.

Essentially, similar to an Array, an Ordered Array collated result is a collection of results who share a layout relationship that are all lined up together (either horizontally, vertically, or in the left/right and top/bottom text flow of the document). However, unlike Array the order and number of the results matter.

Before continuing with this tutorial it is advised that you have a good understanding of how the Array Collation Provider works. Take a look at our wiki page on Arrays.

Array vs. Ordered Array

First we're going to discuss some of the differences between an Array and Ordered Array collated extractor. There are three main differences you need to be aware of:

  • For an Ordered Array the order of your child extractors matters, whereas with Array it does not.
  • All elements of your extractor must be present in order for Ordered Array to extract. There is no Minimum Elements property like when using Array.
  • Unlike Array, Ordered Array will not collect information that is repeated unless there is a repeated child extractor.

Below we will illustrate what this looks like.

Array

  1. First, let's look at this Data Type being collated as an Array.
  2. The Horizontal Layout is enabled since the information we want extracted is in a horizontal layout on the document.


  1. This Data Type has three Value Readers. Each Value Reader is extracting one word: "documents," "Grooper," and "data."
  2. We get the first line here because all three words are present.
  3. This next line is returned because we have all three words present, even though the words are in a different order.


  1. We get this full line because all of the words are present, even if one of the words is repeated.
  2. Since three of the four words in this line are part of our extraction, we are getting a result here.


  1. Notice that we are collecting only two words here.
  2. This is possible because our Maximum Elements property is set to "2".


Ordered Array

  1. Now we have changed the Collation property to an Ordered Array with the Horizontal Layout enabled.
  2. Notice that we have fewer results returned now.


  1. This first line we are getting returned because all three words are present AND they are present in the same order of the Value Reader children under the Data Type.


  1. We are no longer collecting this line, though. It is because the words are not in the order as established by the child extractors. For Ordered Arrays, order matters.
  2. We are also no longer collecting this line. This is because only two of the three words from the extractor are present. In an Ordered Array, all terms must be present in order for the line to be extracted. There is no Minimum Elements property.


  1. We are collecting this line, but we are only collecting the first three words. An Array will collect any duplicated terms, where Ordered Array will not, unless it is added as a fourth extractor.
  2. Here we are collecting the same results as the Array. The first three terms are in order of the child extractors and the fourth term is not part of the extraction.


How To

In the Array Wiki article, we began configuring an Array collated extractor for documents containing multiple street addresses. In the last document of the Batch we ran into a problem that the Array provider could not fix. We will continue where we left off in that article and solve this problem with an Ordered Array.

  1. In the Array Wiki article we set the Collation to Array, enabled Vertical Layout, set a Maximum Distance of 0.25 in, and set Enforce Line Boundaries to True.
  2. On the last document in the Batch, we ran into this situation where none of the settings for Array would allow us to collect these addresses.


  1. We are going to reset the Collation to default settings and then configure this as an Ordered Array.
  2. We are still going to have two extractors referenced on this Data Type. Click the ellipsis icon.


  1. These two extractors are selected.
  2. The selected extractors show up in the order you select them on this side of the window.
  3. Remember, order matters when configuring an Ordered Array. If the extractors are not in the desired order, use these buttons to change the order.


  1. Now we need to set the Collation method. Click the hamburger icon to access the drop down.
  2. Select Ordered Array.


  1. These addresses are stacked on top of each other in a "vertical layout".
  2. An Ordered Array, just like Array collation, needs a layout option selected. We're going to enable the Vertical Layout property.


  1. Now we are collecting each address as desired. Since the order of the extractors matters, we do not need to define a Maximum Distance or Enforce Line Boundaries for this example.
  2. The Address Line and the City, State, Zip extractors are collated and returned as one result.


Order Matters

  1. Just a reminder that the extractor order matters. If we were to invert the order of the extractors, we would get a very different result.


  1. Now we only get a result if the City, State, Zip extractor is found before the Address Line extractor. Be careful how you set up your extractors.


Execution Order

It is important to note that there is an Execution Order hierarchy for how different extractors fire.

There are three different ways to set an extractor:

  • A Local Extractor
  • Child extractor objects
  • Referenced Extractors

The priority of this hierarchy is in that order:

  1. The Local Extractor fires off first.
  2. Any child extractors fire off second.
  3. Referenced Extractors are the last extractors to return a result.


Testing Execution Order

To illustrate the order of execution, we have set up a Data Type capturing three pieces of data using the three different ways to set an extractor in the following example.

  1. The Local Extractor has been configured with a List Match to collect the word "documents".


  1. The child object is configured with a List Match to collect the word "Grooper".


  1. In the Data Type's Referenced Extractors, we are referencing an object that is extracting the word "data".


  1. Now, when set to Ordered Array, the first set (documents, Grooper, data) is returned because of the order the extractors are firing.


Ordered Array and Data Tables

An Ordered Array can also be used to return tables from a page using the Row Match Extract Method.

This works by using an Ordered Array Collation on a Data Type with child objects to find one row of a table. That extractor is then used to detect all of the rows in the table so that it can determine where the table begins and ends. Grooper then can use the child objects of the Data Type to understand where the columns should be. Once the rows and columns are understood, Grooper can then find the values in the individual cells.

  1. In the example below, we have a Data Type with five child objects, each one extracting one part of a row in the gable on the page.
  2. With the Collation set to Individual...
  3. ... each item in the table is being returned individually.


  1. When we change the Collation to Ordered Array with the Horizontal Layout enabled...
  2. ... each row in the table is returned as an individual result.


  1. Select the Data Table.
  2. Click the hamburger icon to the right of teh Extract Method property.
  3. Select Row Match from the drop down menu.


  1. Set the Row Extractor to a Reference.
  2. Reference the configured Ordered Array extractor.

Notice that the Data Columns are named exactly the same as the child objects of the Data Type we are referencing. This is important for Grooper to be able to determine where the columns in the table are.


  1. Now the table will be extracted properly.