2023:Ordered Array (Collation Provider): Difference between revisions

From Grooper Wiki
No edit summary
Tag: Manual revert
 
(11 intermediate revisions by 3 users not shown)
Line 1: Line 1:
{{AutoVersion}}
{{AutoVersion}}


{|class="wip-box"
<blockquote>{{#lst:Glossary|Ordered Array}}</blockquote>
 
{|class="download-box"
|
|
'''WIP'''
[[File:Asset 22@4x.png]]
|
|
This article is a work-in-progress or created as a placeholder for testing purposesThis article is subject to change and/or expansion. It may be incomplete, inaccurate, or stop abruptly.
You may download the ZIP(s) below and upload it into your own Grooper environment (version 2023).  The first contains a '''Project''' with resources used in examples throughout this articleThe second contains one or more '''Batches''' of sample documents.
* [[Media:2023 Wiki Ordered-Array Project.zip]]
* [[Media:2023 Wiki Ordered-Array Batches.zip]]
|}


This tag will be removed upon draft completion.
|}
<section begin="glossary" />
<blockquote>
An '''''Ordered Array''''' is one of many '''''Collation Providers''''' you can use in Grooper to combine or organize extracted data based on the data's layout relationship.
</blockquote>
<section end="glossary" />
== About ==
== About ==
'''''Ordered Array''''' is one  of the '''''Collation Providers''''' and can be used for data organization depending on what you want to extract from your documents. All of the '''''Collation Providers''''' (except for ''Individual'') essentially take multiple results and combine them into one. '''''Ordered Array''''' specifically only returns results based on the orientation of the information. If the data is lined up horizontally or vertically, you must select the corresponding layout property for Grooper to return the information.  
'''''Ordered Array''''' is one  of the '''''Collation Providers''''' and can be used for data organization depending on what you want to extract from your documents. All of the '''''Collation Providers''''' (except for ''Individual'') essentially take multiple results and combine them into one. '''''Ordered Array''''' specifically only returns results based on the orientation of the information. If the data is lined up horizontally or vertically, you must select the corresponding layout property for Grooper to return the information.  
Line 22: Line 20:


{|class="attn-box"
{|class="attn-box"
|-
|⚠
|
|
Before continuing with this tutorial it is advised that you have a good understanding of how the '''''Array''''' '''''Collation Provider''''' works. Take a look at our wiki page on [[Arrays - 2023|Arrays]].
|
Before continuing with this tutorial it is advised that you have a good understanding of how the '''''Array''''' '''''Collation Provider''''' works. Take a look at our wiki page on [[2023:Array (Collation Provider)|Arrays]].
|}
|}


Line 39: Line 37:
==== Array ====
==== Array ====


# First, let's look at this '''Data Type''' being collated as an '''''Array'''''.
# First, let's look at this [[image:GrooperIcon_DataType.png]]'''Data Type''' being collated as an '''''Array'''''.
# The '''''Horizontal Layout''''' is ''enabled'' since the information we want extracted is in a horizontal layout on the document.  
# The '''''Horizontal Layout''''' is ''enabled'' since the information we want extracted is in a horizontal layout on the document.  


[[File:2023 Ordered Array 01 About 01 Array vs Ordered Array 01.png]]
[[File:2023 Ordered Array 01 About 01 Array vs Ordered Array 01.png]]


# This '''Data Type''' has three '''Value Readers'''. Each '''Value Reader''' is extracting one word: "documents," "Grooper," and "data."
 
# This '''Data Type''' has three [[image:GrooperIcon_ValueReader.png]]'''Value Readers'''. Each '''Value Reader''' is extracting one word: "documents," "Grooper," and "data."
# We get the first line here because all three words are present.
# We get the first line here because all three words are present.
# This next line is returned because we have all three words present, even though the words are in a different order.  
# This next line is returned because we have all three words present, even though the words are in a different order.  
Line 50: Line 49:
[[File:2023 Ordered Array 01 About 01 Array vs Ordered Array 02.png]]
[[File:2023 Ordered Array 01 About 01 Array vs Ordered Array 02.png]]


# We get this full like because all of the words are present, even if oen of the words is repeated.
 
# We get this full line because all of the words are present, even if one of the words is repeated.
# Since three of the four words in this line are part of our extraction, we are getting a result here.  
# Since three of the four words in this line are part of our extraction, we are getting a result here.  


[[File:2023 Ordered Array 01 About 01 Array vs Ordered Array 03.png]]
[[File:2023 Ordered Array 01 About 01 Array vs Ordered Array 03.png]]


# Notice that we are collecting only two words here.  
# Notice that we are collecting only two words here.  
Line 67: Line 68:


[[File:2023 Ordered Array 01 About 01 Array vs Ordered Array 05.png]]
[[File:2023 Ordered Array 01 About 01 Array vs Ordered Array 05.png]]


# This first line we are getting returned because all three words are present AND they are present in the same order of the '''Value Reader''' children under the '''Data Type'''.
# This first line we are getting returned because all three words are present AND they are present in the same order of the '''Value Reader''' children under the '''Data Type'''.


[[File:2023 Ordered Array 01 About 01 Array vs Ordered Array 06.png]]
[[File:2023 Ordered Array 01 About 01 Array vs Ordered Array 06.png]]


# We are no longer collecting this line, though. It is because the words are not in the order as established by the child extractors. For '''''Ordered Arrays''''', order matters.
# We are no longer collecting this line, though. It is because the words are not in the order as established by the child extractors. For '''''Ordered Arrays''''', order matters.
Line 76: Line 79:


[[File:2023 Ordered Array 01 About 01 Array vs Ordered Array 07.png]]
[[File:2023 Ordered Array 01 About 01 Array vs Ordered Array 07.png]]


# We are collecting this line, but we are only collecting the first three words. An '''''Array''''' will collect any duplicated terms, where '''''Ordered Array''''' will not, unless it is added as a fourth extractor.  
# We are collecting this line, but we are only collecting the first three words. An '''''Array''''' will collect any duplicated terms, where '''''Ordered Array''''' will not, unless it is added as a fourth extractor.  
Line 81: Line 85:


[[File:2023 Ordered Array 01 About 01 Array vs Ordered Array 08.png]]
[[File:2023 Ordered Array 01 About 01 Array vs Ordered Array 08.png]]


== How To ==
== How To ==
{|class="download-box"
|
[[File:Asset 22@4x.png]]
|
You may download the ZIP(s) below and upload it into your own Grooper environment (version 2023).  The first contains a '''Project''' with resources used in examples throughout this article.  The second contains one or more '''Batches''' of sample documents.
* [[Media:2023_Projects_Wiki_-_Article_Name.zip]]
* [[Media:2023_Batches_Wiki_-_Article_Name.zip]]
|}


In the [[2023:Array (Collation Provider)|Array Wiki article]], we began configuring an '''''Array''''' collated extractor for documents containing multiple street addresses. In the last document of the '''Batch''' we ran into a problem that the '''''Array''''' provider could not fix. We will continue where we left off in that article and solve this problem with an '''''Ordered Array'''''.  
In the [[2023:Array (Collation Provider)|Array Wiki article]], we began configuring an '''''Array''''' collated extractor for documents containing multiple street addresses. In the last document of the '''Batch''' we ran into a problem that the '''''Array''''' provider could not fix. We will continue where we left off in that article and solve this problem with an '''''Ordered Array'''''.  


#In the [[2023:Array (Collation Provider)|Array Wiki article]] we set the '''''Collation''''' to ''Array'', enabled '''''Vertical Layout''''', set a '''''Maximum Distance''''' of 0.25 in, and set '''''Enforce Line Boundaries''''' to ''True''.
# In the [[2023:Array (Collation Provider)|Array Wiki article]] we set the '''''Collation''''' to ''Array'', enabled '''''Vertical Layout''''', set a '''''Maximum Distance''''' of 0.25 in, and set '''''Enforce Line Boundaries''''' to ''True''.
# On the last document in the '''Batch''', we ran into this situation where none of the settings for '''''Array''''' would allow us to collect these addresses.
# On the last document in the '''Batch''', we ran into this situation where none of the settings for '''''Array''''' would allow us to collect these addresses.


Line 103: Line 98:


# We are going to reset the '''''Collation''''' to default settings and then configure this as an '''''Ordered Array'''''.  
# We are going to reset the '''''Collation''''' to default settings and then configure this as an '''''Ordered Array'''''.  
# We are still goign to have two extractors referenced on this '''Data Type'''. Click the ellipsis icon.
# We are still going to have two extractors referenced on this '''Data Type'''. Click the ellipsis icon.


[[File:2023 Ordered Array 02 How To 01 Address Example 02.png]]
[[File:2023 Ordered Array 02 How To 01 Address Example 02.png]]
Line 131: Line 126:


[[File:2023 Ordered Array 02 How To 01 Address Example 06.png]]
[[File:2023 Ordered Array 02 How To 01 Address Example 06.png]]


=== Order Matters ===
=== Order Matters ===
Line 142: Line 138:


[[File:2023 Ordered Array 02 How To 02 Order Matters 02.png]]
[[File:2023 Ordered Array 02 How To 02 Order Matters 02.png]]
=== Execution Order ===
It is important to note that there is an Execution Order hierarchy for how different extractors fire.
There are three different ways to set an extractor:
* A Local Extractor
* Child extractor objects
* Referenced Extractors
The priority of this hierarchy is in that order:
# The '''''Local Extractor''''' fires off first.
# Any child extractors fire off second.
# '''''Referenced Extractors''''' are the last extractors to return a result.
[[File:2023 Ordered Array 02 How To 03 Execution Order 01.png]]
<big>'''Testing Execution Order'''</big>
To illustrate the order of execution, we have set up a '''Data Type''' capturing three pieces of data using the three different ways to set an extractor in the following example.
# The '''''Local Extractor''''' has been configured with a ''List Match'' to collect the word "documents".
[[File:2023 Ordered Array 02 How To 03 Execution Order 02.png]]
#<li value=2> The child object is configured with a ''List Match'' to collect the word "Grooper".
[[File:2023 Ordered Array 02 How To 03 Execution Order 03.png]]
#<li value=3> In the '''Data Type's''' '''''Referenced Extractors''''', we are referencing an object that is extracting the word "data".
[[File:2023 Ordered Array 02 How To 03 Execution Order 04.png]]
#<li value=4> Now, when set to ''Ordered Array'', the first set (documents, Grooper, data) is returned because of the order the extractors are firing.
[[File:2023 Ordered Array 02 How To 03 Execution Order 05.png]]
=== Ordered Array and Data Tables ===
An ''Ordered Array'' can also be used to return tables from a page using the ''Row Match'' '''''Extract Method'''''.
This works by using an ''Ordered Array'' '''''Collation''''' on a '''Data Type''' with child objects to find one row of a table. That extractor is then used to detect all of the rows in the table so that it can determine where the table begins and ends. Grooper then can use the child objects of the '''Data Type''' to understand where the columns should be. Once the rows and columns are understood, Grooper can then find the values in the individual cells.
# In the example below, we have a '''Data Type''' with five child objects, each one extracting one part of a row in the gable on the page.
# With the '''''Collation''''' set to ''Individual''...
# ... each item in the table is being returned individually.
[[File:2023 Ordered Array 02 How To 04 Data Tables 01.png]]
#<li value=4> When we change the '''''Collation''''' to ''Ordered Array'' with the '''''Horizontal Layout''''' enabled...
# ... each row in the table is returned as an individual result.
[[File:2023 Ordered Array 02 How To 04 Data Tables 02.png]]
#<li value=6> Select the '''Data Table'''.
# Click the hamburger icon to the right of teh '''''Extract Method''''' property.
# Select ''Row Match'' from the drop down menu.
[[File:2023 Ordered Array 02 How To 04 Data Tables 03.png]]
#<li value=9> Set the '''''Row Extractor''''' to a ''Reference''.
# Reference the configured ''Ordered Array'' extractor.
{|class="attn-box"
|
|
Notice that the '''Data Columns''' are named exactly the same as the child objects of the '''Data Type''' we are referencing. This is important for Grooper to be able to determine where the columns in the table are.
|}
[[File:2023 Ordered Array 02 How To 04 Data Tables 04.png]]
#<li value=11> Now the table will be extracted properly.
[[File:2023 Ordered Array 02 How To 04 Data Tables 05.png]]

Latest revision as of 10:02, 22 November 2024

This article is about an older version of Grooper.

Information may be out of date and UI elements may have changed.

20252023

Ordered Array is a Collation Provider option for pin Data Type extractors. Ordered Array finds sequences of values where one result is present for each extractor, in the order they appear, according to a specified horizontal, vertical or text-flow layout.

You may download the ZIP(s) below and upload it into your own Grooper environment (version 2023). The first contains a Project with resources used in examples throughout this article. The second contains one or more Batches of sample documents.

About

Ordered Array is one of the Collation Providers and can be used for data organization depending on what you want to extract from your documents. All of the Collation Providers (except for Individual) essentially take multiple results and combine them into one. Ordered Array specifically only returns results based on the orientation of the information. If the data is lined up horizontally or vertically, you must select the corresponding layout property for Grooper to return the information.

From this basic information you might think that Ordered Array and Array just do the same thing. They are very similar, but Ordered Array has additional rules. The order of the data has to match the order of the child extractor objects under the object the collation is set on. Also, all of the child extractors must return something for Ordered Array to work.

Essentially, similar to an Array, an Ordered Array collated result is a collection of results who share a layout relationship that are all lined up together (either horizontally, vertically, or in the left/right and top/bottom text flow of the document). However, unlike Array the order and number of the results matter.

Before continuing with this tutorial it is advised that you have a good understanding of how the Array Collation Provider works. Take a look at our wiki page on Arrays.

Array vs. Ordered Array

First we're going to discuss some of the differences between an Array and Ordered Array collated extractor. There are three main differences you need to be aware of:

  • For an Ordered Array the order of your child extractors matters, whereas with Array it does not.
  • All elements of your extractor must be present in order for Ordered Array to extract. There is no Minimum Elements property like when using Array.
  • Unlike Array, Ordered Array will not collect information that is repeated unless there is a repeated child extractor.

Below we will illustrate what this looks like.

Array

  1. First, let's look at this Data Type being collated as an Array.
  2. The Horizontal Layout is enabled since the information we want extracted is in a horizontal layout on the document.


  1. This Data Type has three Value Readers. Each Value Reader is extracting one word: "documents," "Grooper," and "data."
  2. We get the first line here because all three words are present.
  3. This next line is returned because we have all three words present, even though the words are in a different order.


  1. We get this full line because all of the words are present, even if one of the words is repeated.
  2. Since three of the four words in this line are part of our extraction, we are getting a result here.


  1. Notice that we are collecting only two words here.
  2. This is possible because our Maximum Elements property is set to "2".


Ordered Array

  1. Now we have changed the Collation property to an Ordered Array with the Horizontal Layout enabled.
  2. Notice that we have fewer results returned now.


  1. This first line we are getting returned because all three words are present AND they are present in the same order of the Value Reader children under the Data Type.


  1. We are no longer collecting this line, though. It is because the words are not in the order as established by the child extractors. For Ordered Arrays, order matters.
  2. We are also no longer collecting this line. This is because only two of the three words from the extractor are present. In an Ordered Array, all terms must be present in order for the line to be extracted. There is no Minimum Elements property.


  1. We are collecting this line, but we are only collecting the first three words. An Array will collect any duplicated terms, where Ordered Array will not, unless it is added as a fourth extractor.
  2. Here we are collecting the same results as the Array. The first three terms are in order of the child extractors and the fourth term is not part of the extraction.


How To

In the Array Wiki article, we began configuring an Array collated extractor for documents containing multiple street addresses. In the last document of the Batch we ran into a problem that the Array provider could not fix. We will continue where we left off in that article and solve this problem with an Ordered Array.

  1. In the Array Wiki article we set the Collation to Array, enabled Vertical Layout, set a Maximum Distance of 0.25 in, and set Enforce Line Boundaries to True.
  2. On the last document in the Batch, we ran into this situation where none of the settings for Array would allow us to collect these addresses.


  1. We are going to reset the Collation to default settings and then configure this as an Ordered Array.
  2. We are still going to have two extractors referenced on this Data Type. Click the ellipsis icon.


  1. These two extractors are selected.
  2. The selected extractors show up in the order you select them on this side of the window.
  3. Remember, order matters when configuring an Ordered Array. If the extractors are not in the desired order, use these buttons to change the order.


  1. Now we need to set the Collation method. Click the hamburger icon to access the drop down.
  2. Select Ordered Array.


  1. These addresses are stacked on top of each other in a "vertical layout".
  2. An Ordered Array, just like Array collation, needs a layout option selected. We're going to enable the Vertical Layout property.


  1. Now we are collecting each address as desired. Since the order of the extractors matters, we do not need to define a Maximum Distance or Enforce Line Boundaries for this example.
  2. The Address Line and the City, State, Zip extractors are collated and returned as one result.


Order Matters

  1. Just a reminder that the extractor order matters. If we were to invert the order of the extractors, we would get a very different result.


  1. Now we only get a result if the City, State, Zip extractor is found before the Address Line extractor. Be careful how you set up your extractors.


Execution Order

It is important to note that there is an Execution Order hierarchy for how different extractors fire.

There are three different ways to set an extractor:

  • A Local Extractor
  • Child extractor objects
  • Referenced Extractors

The priority of this hierarchy is in that order:

  1. The Local Extractor fires off first.
  2. Any child extractors fire off second.
  3. Referenced Extractors are the last extractors to return a result.


Testing Execution Order

To illustrate the order of execution, we have set up a Data Type capturing three pieces of data using the three different ways to set an extractor in the following example.

  1. The Local Extractor has been configured with a List Match to collect the word "documents".


  1. The child object is configured with a List Match to collect the word "Grooper".


  1. In the Data Type's Referenced Extractors, we are referencing an object that is extracting the word "data".


  1. Now, when set to Ordered Array, the first set (documents, Grooper, data) is returned because of the order the extractors are firing.


Ordered Array and Data Tables

An Ordered Array can also be used to return tables from a page using the Row Match Extract Method.

This works by using an Ordered Array Collation on a Data Type with child objects to find one row of a table. That extractor is then used to detect all of the rows in the table so that it can determine where the table begins and ends. Grooper then can use the child objects of the Data Type to understand where the columns should be. Once the rows and columns are understood, Grooper can then find the values in the individual cells.

  1. In the example below, we have a Data Type with five child objects, each one extracting one part of a row in the gable on the page.
  2. With the Collation set to Individual...
  3. ... each item in the table is being returned individually.


  1. When we change the Collation to Ordered Array with the Horizontal Layout enabled...
  2. ... each row in the table is returned as an individual result.


  1. Select the Data Table.
  2. Click the hamburger icon to the right of teh Extract Method property.
  3. Select Row Match from the drop down menu.


  1. Set the Row Extractor to a Reference.
  2. Reference the configured Ordered Array extractor.

Notice that the Data Columns are named exactly the same as the child objects of the Data Type we are referencing. This is important for Grooper to be able to determine where the columns in the table are.


  1. Now the table will be extracted properly.