2023:Tabular Layout (Table Extract Method): Difference between revisions

From Grooper Wiki
No edit summary
No edit summary
 
(14 intermediate revisions by 2 users not shown)
Line 1: Line 1:
<blockquote style="font-size:14pt">
{{AutoVersion}}
'''''Tabular Layout''''' is one of '''Grooper's''' methods of extracting table data from documents available to '''Data Table''' objects (via its '''''Extract Method''''' property).  This method uses column header values determined by the '''Data Columns''''' '''''Header Extractor''''' results (or labels collected for the '''Data Columns''' when a '''''Labeling Behavior''''' is enabled) as well as '''Data Column''' '''''Value Extractor''''' results to model a table's structure and return its values.
</blockquote>


{| class="wikitable" style="margin:left"
<blockquote>{{#lst:Glossary|Tabular Layout}}</blockquote>
! Previous Versions
|-
|
[[Tabular Layout (Table Extraction Method) - 2021|Tabular Layout (Table Extraction Method) - 2021]]
<br>
|}


The '''''Tabular Layout''''' is "Label Set aware". You can configure '''''Tabular Layout''''' with or without labels. This article will detail both methods. For more information on Label Sets, please visit the full [[Label Sets]] article.
The '''''Tabular Layout''''' is "Label Set aware". You can configure '''''Tabular Layout''''' with or without labels. This article will detail both methods. For more information on Label Sets, please visit the full [[Label Sets]] article.


 
{|class="download-box"
== About ==
|
 
[[File:Asset 22@4x.png]]
{|cellpadding="10" cellspacing="5"  
|
|-
You may download and import the file(s) below into your own Grooper environment (version 2023). There are two '''Batches''' with the example document(s) discussed in this tutorial, as well as two '''Projects''' configured according to its instructions.
|style="font-size:14pt; color:#f89420; border: 2px solid #f89420; width:40px"|[[File:Asset 22@4x.png]]
|style="border: 2px solid #f89420"|
You may download and import the file(s) below into your own Grooper environment (version 2023). There are two '''Batches''' with the example document(s) discussed in this tutorial, as well as two '''Projects''' configured according to its instructions.
<br>
<br>
Please upload the '''Projects''' to your '''Grooper''' environment before uploading the '''Batches'''. This will allow the documents within the '''Batches''' to maintain their classification status.
Please upload the '''Projects''' to your '''Grooper''' environment before uploading the '''Batches'''. This will allow the documents within the '''Batches''' to maintain their classification status.
* [[Media:Tabular Layout - Batch (v2023).zip]]
* [[Media:2023_Wiki_Tabular-Layout_Projects.zip]]
* [[Media:Tabular Layout - Project (v2023).zip]]
* [[Media:2023_Wiki_Tabular-Layout_Batches.zip]]
 
If you are using Label Sets, please download and import these files.
* [[Media:Tabular Layout - Batch - With Label Sets (v2023).zip]]
* [[Media:Tabular Layout - Project - With Label Sets (v2023).zip]]
|}
|}


Many tables label the columns so the reader knows what the data in that column corresponds to. How do you know the unit price for an item on an invoice?  Typically, that item is in a table and one of the columns of that table is labeled "Unit Price" or something similar. Once you read the labels for each column (also called "column headers"), you the reader know where the table begins (below the column headers) and can identify the data in each row (by understanding what the column headers refer to).
== About ==
Many tables label the columns so the reader knows what the data in that column corresponds to. How do you know the unit price for an item on an invoice?  Typically, that item is in a table and one of the columns of that table is labeled "Unit Price" or something similar. Once you read the labels for each column (also called "column headers"), you the reader know where the table begins (below the column headers) and can identify the data in each row (by understanding what the column headers refer to).


This is also the basic idea behind the '''''Tabular Layout''''' '''''Extraction Method'''''. It too utilizes column header labels to "read" tables on documents, or at least as step number one in modeling the table's structure. Once '''Grooper''' knows where a column is, identified by the column's header label, '''Grooper''' can extract data from each cell in each row of that column.
This is also the basic idea behind the '''''Tabular Layout''''' '''''Extraction Method'''''. It too utilizes column header labels to "read" tables on documents, or at least as step number one in modeling the table's structure. Once '''Grooper''' knows where a column is, identified by the column's header label, '''Grooper''' can extract data from each cell in each row of that column.


The '''''Tabular Layout''''' method can establish column header locations in one of two ways:
The '''''Tabular Layout''''' method can establish column header locations in one of two ways:
Line 42: Line 28:
#* Effectively, the labels take the place of the '''''Header Extractor''''' results (or alternatively the '''''Header Row Extractor''''' results)
#* Effectively, the labels take the place of the '''''Header Extractor''''' results (or alternatively the '''''Header Row Extractor''''' results)


Once the column header locations are established, the next thing '''Grooper''' needs to do is figure out where each row is. Tabular data is most often dynamic data. A table on one document might have two rows. The same table on the next might have twenty. How does '''Grooper''' know where each row is?
Once the column header locations are established, the next thing '''Grooper''' needs to do is figure out where each row is. Tabular data is most often dynamic data. A table on one document might have two rows. The same table on the next might have twenty. How does '''Grooper''' know where each row is?


This is done by configuring at least one '''Data Column's''' '''''Value Extractor''''' property (However, more than one, even all, may be configured. Depending on how complicated the table is, you may need to configure extractors for multiple columns.)
This is done by configuring at least one '''Data Column's''' '''''Value Extractor''''' property (However, more than one, even all, may be configured. Depending on how complicated the table is, you may need to configure extractors for multiple columns.)


Generally, there is at least one column in a table that is always present for every row in the table. If you can use an extractor to locate that data below its corresponding column header, that gives you a way of finding each row in the table. This allows '''Grooper''' to form a "row instance" for each row. Once the row instance is established, '''Grooper''' can then collect the various cell values for the various additional columns from the row instance.
Generally, there is at least one column in a table that is always present for every row in the table. If you can use an extractor to locate that data below its corresponding column header, that gives you a way of finding each row in the table. This allows '''Grooper''' to form a "row instance" for each row. Once the row instance is established, '''Grooper''' can then collect the various cell values for the various additional columns from the row instance.


If locating column headers and locating rows using column extractors was all that was involved in '''''Tabular Layout''''', that alone would make it a powerful tabular extraction method. What makes the '''''Tabular Layout''''' method even more powerful is its further configurability. Is every row in the table a single line or are the rows "multiline"?  Do you need more fine-tuned data extraction from a cell's value or the row itself once the row instance is detected?  Do you need to establish a table "footer" to limit the number of rows extracted?  We will address these issues and more in the [[#Advanced Setup Considerations]] section of this article.
If locating column headers and locating rows using column extractors was all that was involved in '''''Tabular Layout''''', that alone would make it a powerful tabular extraction method. What makes the '''''Tabular Layout''''' method even more powerful is its further configurability. Is every row in the table a single line or are the rows "multiline"?  Do you need more fine-tuned data extraction from a cell's value or the row itself once the row instance is detected?  Do you need to establish a table "footer" to limit the number of rows extracted?  We will address these issues and more in the [[#Advanced Setup Considerations]] section of this article.


{|class="fyi-box"
{|class="fyi-box"
Line 55: Line 41:
'''FYI'''
'''FYI'''
|
|
If your familiar with the '''''Header-Value''''' table extraction method, you should see some similarities between it and the '''''Tabular Layout''''' method. Indeed both methods utilize column headers and '''Data Column''' '''''Value Extractors''''' to collect table data.
If your familiar with the '''''Header-Value''''' table extraction method, you should see some similarities between it and the '''''Tabular Layout''''' method. Indeed both methods utilize column headers and '''Data Column''' '''''Value Extractors''''' to collect table data.


'''''Tabular Layout''''' should be seen as an improvement on '''''Header-Value''''' for the following reasons:
'''''Tabular Layout''''' should be seen as an improvement on '''''Header-Value''''' for the following reasons:
Line 65: Line 51:
== Basic Setup ==
== Basic Setup ==


'''''Tabular Layout''''' can be configured with or without the use of Label Sets. In either case, the basic setup is the same:
'''''Tabular Layout''''' can be configured with or without the use of Label Sets. In either case, the basic setup is the same:


# Establish column headers for each '''Data Column'''.
# Establish column headers for each '''Data Column'''.
Line 72: Line 58:
# Test extraction and configure further as necessary.
# Test extraction and configure further as necessary.


With Label Sets or without, the setup is extremely similar. On top of that, there's nothing about using Label Sets that alters '''''Tabular Layout's''''' extraction logic. '''Grooper''' uses the same logic to model the table's structure and collect data for each cell. The biggest difference is how column headers are determined in step #1.
With Label Sets or without, the setup is extremely similar. On top of that, there's nothing about using Label Sets that alters '''''Tabular Layout's''''' extraction logic. '''Grooper''' uses the same logic to model the table's structure and collect data for each cell. The biggest difference is how column headers are determined in step #1.
* Without Label Sets, column headers are established using extractors, defined using the '''Data Columns'''' '''''Header Extractor''''' property (or alternatively using the '''Data Table's''' '''''Header Row Extractor''''' property)
* Without Label Sets, column headers are established using extractors, defined using the '''Data Columns'''' '''''Header Extractor''''' property (or alternatively using the '''Data Table's''' '''''Header Row Extractor''''' property)
* With Label Sets, column headers are established using labels, defined when collecting labels for each '''Document Type'''. The '''Data Columns'''' labels effectively take the place of the '''''Header Extractor''''' property's results.
* With Label Sets, column headers are established using labels, defined when collecting labels for each '''Document Type'''. The '''Data Columns'''' labels effectively take the place of the '''''Header Extractor''''' property's results.


=== Tabular Layout Without Label Sets ===
=== Tabular Layout Without Label Sets ===
==== Overview ====


<tabs style="margin:20px">
This tutorial will cover the basic configuration of the '''''Tabular Layout''''' method ''without'' Label Sets, using extractors to collect column headers instead. We will use invoices for our document set and collect the following data from their tables detailing line item information:
<tab name="Overview" style="margin:20px">
=== Overview ===
 
This tutorial will cover the basic configuration of the '''''Tabular Layout''''' method ''without'' Label Sets, using extractors to collect column headers instead. We will use invoices for our document set and collect the following data from their tables detailing line item information:
* Item Number - The vendor's id number for the item ordered for each row.
* Item Number - The vendor's id number for the item ordered for each row.
* Description - The description of each item ordered for each row.
* Description - The description of each item ordered for each row.
Line 89: Line 72:
* Line Total - The total price for the number of items ordered (In other words, the quantity ordered multiplied by the unit price)
* Line Total - The total price for the number of items ordered (In other words, the quantity ordered multiplied by the unit price)


{|cellpadding=10 cellspacing=5
 
|valign=top style="width:40%"|
The basic steps will be as follows:
The basic steps will be as follows:


Line 97: Line 79:
#* Alternatively, you may configure a '''''Header Row Extractor''''' set on the '''Data Table''' (This property is found in the '''''Tabular Layout''''' sub-properties).
#* Alternatively, you may configure a '''''Header Row Extractor''''' set on the '''Data Table''' (This property is found in the '''''Tabular Layout''''' sub-properties).
# Assign a '''''Value Extractor''''' for at least one '''Data Column'''.
# Assign a '''''Value Extractor''''' for at least one '''Data Column'''.
#* For example, we may expect to find a quantity for each item shipped on an invoice, regardless of the vendor. There's always a column with a "Quantity" or "QTY" or "Shipped" or some similar header.
#* For example, we may expect to find a quantity for each item shipped on an invoice, regardless of the vendor. There's always a column with a "Quantity" or "QTY" or "Shipped" or some similar header.
#* Since this data is also present on ''every row'', this will provide the information necessary to find each row in the table.
#* Since this data is also present on ''every row'', this will provide the information necessary to find each row in the table.
#* While you need ''at least'' one '''Data Column's''' '''''Value Extractor''''' configured to detect rows, multiple columns may be used to detect rows.
#* While you need ''at least'' one '''Data Column's''' '''''Value Extractor''''' configured to detect rows, multiple columns may be used to detect rows.  
#** Furthermore, a '''Data Column's''' '''''Value Extractor''''' will either perform "Primary Extraction" to perform row detection or "Secondary Extraction" to extract data from already detected rows. We will discus using multiple columns to detect rows and the differences between "Primary" and "Secondary Extraction" in the [[#Advanced Setup Considerations]] section of this article.
#** Furthermore, a '''Data Column's''' '''''Value Extractor''''' will either perform "Primary Extraction" to perform row detection or "Secondary Extraction" to extract data from already detected rows. We will discus using multiple columns to detect rows and the differences between "Primary" and "Secondary Extraction" in the [[#Advanced Setup Considerations]] section of this article.
|valign=top|
 
[[File:2023_TabularLayout_001_Overview_01.png]]
[[File:2023_TabularLayout_001_Overview_01.png]]
|-
 
|valign=top|
 
<br>
#<li value=3> Set the '''Data Table''' object's '''''Extract Method''''' property to ''Tabular Layout''.</li>
#<li value=3> Set the '''Data Table''' object's '''''Extract Method''''' property to ''Tabular Layout''.</li>
#* And configure any '''''Tabular Layout''''' properties as needed. We will discuss many of these properties, why and how to to use them in the [[#Advanced Setup Considerations]] section of this article.
#* And configure any '''''Tabular Layout''''' properties as needed. We will discuss many of these properties, why and how to to use them in the [[#Advanced Setup Considerations]] section of this article.
# Test to ensure the table's data is collected.
# Test to ensure the table's data is collected.
|valign=top|
 
[[File:2023_TabularLayout_001_Overview_02.png]]
[[File:2023_TabularLayout_001_Overview_02.png]]
|}
In a perfect world, you're done at that point.  As you can see in this example, we've populated a table.  Data is collected for all four '''Data Columns''' for each row on the document.


However, the world is rarely perfect.  We will discuss some further configuration considerations to help you get the most out of this table extraction method in the [[#Advanced Setup Considerations]] section below.
</tab>
<tab name="1. Configure Header Extractors" style="margin:20px">


=== 1. Configure Header Extractors ===
In a perfect world, you're done at that point. As you can see in this example, we've populated a table. Data is collected for all four '''Data Columns''' for each row on the document.
 
However, the world is rarely perfect. We will discuss some further configuration considerations to help you get the most out of this table extraction method in the [[#Advanced Setup Considerations]] section below.


==== 1. Configure Header Extractors ====
As far as ''strict'' requirements go for the '''''Tabular Layout''''' method goes, you must ''at minimum'' establish column headers for each '''Data Column''' you wish to extract.
As far as ''strict'' requirements go for the '''''Tabular Layout''''' method goes, you must ''at minimum'' establish column headers for each '''Data Column''' you wish to extract.


Line 125: Line 104:
* FYI:  If the invoice lists both a "quantity ordered" and a "quantity shipped" column, we will be collecting the quantity ''shipped''.
* FYI:  If the invoice lists both a "quantity ordered" and a "quantity shipped" column, we will be collecting the quantity ''shipped''.


{|cellpadding=10 cellspacing=5
 
|valign=top style="width:40%"|
<br>
# Select the '''Data Column'''.
# Select the '''Data Column'''.
# Select the '''''Header Extractor''''' property.
# Select the '''''Header Extractor''''' property.
#* Here you will set an extractor to locate the column header on the document for the selected '''Data Column'''.
#* Here you will set an extractor to locate the column header on the document for the selected '''Data Column'''.
# Using the dropdown selector, select the '''''Extractor Type''''' you wish to configure to return the column header.
# Using the dropdown selector, select the extractor (Extractor Node or Value Extractor) you wish to configure to return the column header.
#* You can use whatever '''''Extractor Type''''' you want to get the job done. You may select ''Reference'' to reference a '''Data Type''' or '''Value Reader''' extractor object you've configured already. Or, you can select one of the other '''''Extractor Types''''' to configure extraction locally.
#* You can use whatever extractor you want to get the job done. You may select ''Reference'' to reference a '''Data Type''' or '''Value Reader''' extractor node you've configured already. Or, you can select one of Grooper's Value Extractors to configure extraction locally.
#* We're going to select ''List Match''.
#* We're going to select ''List Match''.
|
 
[[File:2023_TabularLayout_002_Configure-Header-Extractors_01.png]]
[[File:2023_TabularLayout_002_Configure-Header-Extractors_01.png]]
|-
 
|valign=top|
 
<br>
The '''''List Match''''' extractor is well suited for our purposes here. Ultimately, we will enter a list of various ways a "Quantity" column can be labeled.
The '''''List Match''''' extractor is well suited for our purposes here. Ultimately, we will enter a list of various ways a "Quantity" column can be labeled.


# For example, this document labels quantities of each item ordered as "HRS / QTY"
# For example, this document labels quantities of each item ordered as "HRS / QTY"
# So, we've added <code>HRS / QTY</code> to the '''''Local Entries''''' list.
# So, we've added <code>HRS / QTY</code> to the '''''Local Entries''''' list.
# Other documents use the label "Quantity" or "Shipped". So, we've added <code>Quantity</code> and <code>Shipped</code> to the list as well.
# Other documents use the label "Quantity" or "Shipped". So, we've added <code>Quantity</code> and <code>Shipped</code> to the list as well.


You would then continue adding variations to the list until all variations of the "Quantity" column's header labels are extracted for every variation of the table.
You would then continue adding variations to the list until all variations of the "Quantity" column's header labels are extracted for every variation of the table.
* Or more generally, until a result for the column header is extracted using whatever '''''Extractor Type''''' you've chosen to configure.
* Or more generally, until a result for the column header is extracted using whatever extractor you've chosen to configure.
|valign=top|
 
[[File:2023_TabularLayout_002_Configure-Header-Extractors_02.png]]
[[File:2023_TabularLayout_002_Configure-Header-Extractors_02.png]]
|-
|valign=top|
=== Pro Tip:  Stacked Labels ===


You will often find "stacked labels" in tables. These are multi-word labels broken up across multiple lines in the table's header.
===== Pro Tip:  Stacked Labels =====
You will often find "stacked labels" in tables. These are multi-word labels broken up across multiple lines in the table's header.


# For example, this document's "Quantity" column uses "Qty Shp." for its label.
# For example, this document's "Quantity" column uses "Qty Shp." for its label.
Line 159: Line 133:
# We can add "Qty Ship." to our list of header labels.
# We can add "Qty Ship." to our list of header labels.
# However, we will not get a result returned for the document.
# However, we will not get a result returned for the document.
|valign=top|
 
[[File:2023_TabularLayout_002_Configure-Header-Extractors_03.png]]
[[File:2023_TabularLayout_002_Configure-Header-Extractors_03.png]]
|-
 
|valign=top|
 
<br>
We can easily resolve this by enabling the '''''Vertical Wrap''''' feature.
We can easily resolve this by enabling the '''''Vertical Wrap''''' feature.
* This feature is only available to the '''''List Match''''' extractor. This is one of the reasons why '''''List Match''''' is so useful for extracting column headers.
* This feature is only available to the '''''List Match''''' extractor. This is one of the reasons why '''''List Match''''' is so useful for extracting column headers.


To enable '''''Vertical Wrap''''':
To enable '''''Vertical Wrap''''':
Line 172: Line 145:
# With '''''Vertical Wrap''''' enabled, the extractor is able to match and return items in the list that wrap vertically on multiple lines.
# With '''''Vertical Wrap''''' enabled, the extractor is able to match and return items in the list that wrap vertically on multiple lines.
#* In our case, our stacked label "Qty Shp." is now returned.
#* In our case, our stacked label "Qty Shp." is now returned.
|valign=top|
 
[[File:2023_TabularLayout_002_Configure-Header-Extractors_04.png]]
[[File:2023_TabularLayout_002_Configure-Header-Extractors_04.png]]
|-
|valign=top|
=== Repeat Until All Data Columns Are Configured ===


===== Repeat Until All Data Columns Are Configured =====
You will repeat the same process for each '''Data Column''' you want to collect.
You will repeat the same process for each '''Data Column''' you want to collect.


Line 184: Line 156:




Once the '''''Header Extractor''''' for each '''Data Column''' is configured, Grooper will "know" where our tables "start". However, all the actual data in the table is defined by its ''rows''. How does Grooper know where each row is?  We will discuss that in the next tab.
Once the '''''Header Extractor''''' for each '''Data Column''' is configured, Grooper will "know" where our tables "start". However, all the actual data in the table is defined by its ''rows''. How does Grooper know where each row is?  We will discuss that in the next tab.


|valign=top|
[[File:2023_TabularLayout_002_Configure-Header-Extractors_05.png]]
[[File:2023_TabularLayout_002_Configure-Header-Extractors_05.png]]
|}


For our document set, we used the following lists of header column labels:
For our document set, we used the following lists of header column labels:
Line 245: Line 215:
You may have noticed <code>Part Number/Description</code> is present in both the "Item Number" and "Description" columns' header lists.
You may have noticed <code>Part Number/Description</code> is present in both the "Item Number" and "Description" columns' header lists.


This can happen. Depending on a table's format, what would normally be divided up between two columns on other documents may be jammed into one. '''''Tabular Layout''''' has methods to account for this, using what's called "Secondary Extraction".
This can happen. Depending on a table's format, what would normally be divided up between two columns on other documents may be jammed into one. '''''Tabular Layout''''' has methods to account for this, using what's called "Secondary Extraction".
:&bull; For more information on Secondary Extraction, please visit the <span style="background-color:white; padding:3px">[[#Primary VS Secondary Extraction]]</span> portion of this article.
:&bull; For more information on Secondary Extraction, please visit the <span style="background-color:white; padding:3px">[[#Primary VS Secondary Extraction]]</span> portion of this article.
|}
|}
</tab>
<tab name="2. Assign a Data Column's Value Extractor" style="margin:20px">


=== 2. Assign a Data Column's Value Extractor ===
==== 2. Assign a Data Column's Value Extractor ====
This step is all about '''''row detection'''''.


This step is all about '''''row detection'''''.
So far all we've done is established header column positions on each document. But, that's not where the data is. The table's data is in the ''rows''.


So far all we've done is established header column positions on each document. But, that's not where the data is. The table's data is in the ''rows''.
As it stands, Grooper doesn't know anything about the rows in the tables. It doesn't know the size of each row. It doesn't know what kind of data is supposed to be in the rows. Maybe most importantly, it doesn't know ''how many'' rows there are. Tables tend to be dynamic. They may have 3 rows on one document and 300 on the next. Grooper needs a way of detecting this.


As it stands, Grooper doesn't know anything about the rows in the tables.  It doesn't know the size of each row.  It doesn't know what kind of data is supposed to be in the rows.  Maybe most importantly, it doesn't know ''how many'' rows there are. Tables tend to be dynamic.  They may have 3 rows on one document and 300 on the next.  Grooper needs a way of detecting this.
To detect rows, we need at least one '''Data Column's''' '''''Value Extractor''''' property configured. For each result the extractor produces below the column's header, Grooper will create one row instance.


{|cellpadding=10 cellspacing=5
The key thing to keep in mind is this data ''must'' be present on every row. You'll want to pick a column whos data is always present for every row, where it would be considered invalid if the information wasn't in that cell for a given row.
|valign=top style="width:40%"|
To detect rows, we need at least one '''Data Column's''' '''''Value Extractor''''' property configured.  For each result the extractor produces below the column's header, Grooper will create one row instance.


The key thing to keep in mind is this data ''must'' be present on every row.  You'll want to pick a column whos data is always present for every row, where it would be considered invalid if the information wasn't in that cell for a given row.
In our case, we will choose the "Quantity" '''Data Column'''. We always expect (for the time being anyway) there to be a quantity listed for the line item on the invoice.
 
In our case, we will choose the "Quantity" '''Data Column'''. We always expect (for the time being anyway) there to be a quantity listed for the line item on the invoice.


# We will use this '''Value Reader''' for our demonstration.
# We will use this '''Value Reader''' for our demonstration.
#* However, in the real world, the extraction world is your oyster. You'll configure an extractor to best target the data in whatever table column you're trying to extract.
#* However, in the real world, the extraction world is your oyster. You'll configure an extractor to best target the data in whatever table column you're trying to extract.
# This is a fairly simple '''''Pattern Match''''' extractor designed to return numeric data (including currency).
# This is a fairly simple '''''Pattern Match''''' extractor designed to return numeric data (including currency).
# The regex is a fairly simple pattern to match generic quantities.
# The regex is a fairly simple pattern to match generic quantities.
#* It'll match decimal values from 0 and above with two decimal places optional.
#* It'll match decimal values from 0 and above with two decimal places optional.
# We've also edited our '''''Prefix''''' and '''''Suffix Patterns''''' so that the pattern must be surrounded by a space character before and after, with an optional dollar sign before the number.
# We've also edited our '''''Prefix''''' and '''''Suffix Patterns''''' so that the pattern must be surrounded by a space character before and after, with an optional dollar sign before the number.
# As you can see, we get five results below the "Quantity" label.
# As you can see, we get five results below the "Quantity" label.  
#* When we assign this '''Value Reader''' to the "Quantity" '''Data Column''', we should then get five rows when this table extracts.
#* When we assign this '''Value Reader''' to the "Quantity" '''Data Column''', we should then get five rows when this table extracts.
|valign=top|
 
[[File:2023_TabularLayout_003_Assign-a-Data-Column%27s-Value-Extractor_01.png]]
[[File:2023_TabularLayout_003_Assign-a-Data-Column%27s-Value-Extractor_01.png]]
|-
 
|valign=top|
 
<br>
We do get a bunch of other hits as well. This is a very generic extractor matching very generic numerical data.
We do get a bunch of other hits as well. This is a very generic extractor matching very generic numerical data.


# Will this result present a problem?  Will we get an extra row for its result?
# Will this result present a problem?  Will we get an extra row for its result?
#* No. That result is ''above'' the header label <code>HRS / QTY</code> established by the '''Data Column's''' '''''Header Extractor'''''.
#* No. That result is ''above'' the header label <code>HRS / QTY</code> established by the '''Data Column's''' '''''Header Extractor'''''.  
#* The '''''Tabular Layout''''' method presumes rows are ''below'' column labels. Any and all results above the first instance of the column's headers will be ignored.
#* The '''''Tabular Layout''''' method presumes rows are ''below'' column labels. Any and all results above the first instance of the column's headers will be ignored.
# What about these matching results on the same line?  Will the extra results create additional row instances?
# What about these matching results on the same line?  Will the extra results create additional row instances?
#* No. These results are misaligned with the "Quantity" '''Data Column's''' header. They are too far to the right to be considered under the column header. They will be ignored.
#* No. These results are misaligned with the "Quantity" '''Data Column's''' header. They are too far to the right to be considered under the column header. They will be ignored.
#* Only results aligned with the "Quantity" '''Data Column's''' header will create a row instance.
#* Only results aligned with the "Quantity" '''Data Column's''' header will create a row instance.
# What about these results?  Will they produce a row?
# What about these results?  Will they produce a row?
#* No. These results are also misaligned with the "Quantity" '''Data Column's''' header.
#* No. These results are also misaligned with the "Quantity" '''Data Column's''' header.
#* That said, if these ''were'' aligned with the "Quantity" '''Data Column's''' header, they ''would'' produce row instances.
#* That said, if these ''were'' aligned with the "Quantity" '''Data Column's''' header, they ''would'' produce row instances.
#* When you are building your own '''Data Column''' extractors, pay close attention to results below the column's header. They have the most potential to produce false positive results, producing erroneous rows.
#* When you are building your own '''Data Column''' extractors, pay close attention to results below the column's header. They have the most potential to produce false positive results, producing erroneous rows.
#** That said, there are a multitude of ways to avoid false positive row results when using '''Data Columns'''' '''''Value Extractors''''' to detect rows. We will discuss this more in the [[#Advanced Setup Considerations]] portion of this article.
#** That said, there are a multitude of ways to avoid false positive row results when using '''Data Columns'''' '''''Value Extractors''''' to detect rows. We will discuss this more in the [[#Advanced Setup Considerations]] portion of this article.
|
 
[[File:2023_TabularLayout_003_Assign-a-Data-Column%27s-Value-Extractor_02.png]]
[[File:2023_TabularLayout_003_Assign-a-Data-Column%27s-Value-Extractor_02.png]]
|-
 
|valign=top|
 
<br>
With our extractor ready to go, all we need to do is assign it to the "Quantity" '''Data Column''' using its '''''Value Extractor''''' property.
With our extractor ready to go, all we need to do is assign it to the "Quantity" '''Data Column''' using its '''''Value Extractor''''' property.


Line 313: Line 276:
At ''bare minimum'' you must configure at least one '''Data Column's''' '''''Value Extractor''''' to perform row detection.
At ''bare minimum'' you must configure at least one '''Data Column's''' '''''Value Extractor''''' to perform row detection.


However, multiple columns may be used to perform row detection by configuring their corresponding '''Data Columns''' '''''Value Extractor''''' properties. For more information on using multiple columns in row detection (as well as row detection in general) please visit the <span style="background-color:white; padding:3px">[[#Advanced Row Detection]]</span> section of this article.
However, multiple columns may be used to perform row detection by configuring their corresponding '''Data Columns''' '''''Value Extractor''''' properties. For more information on using multiple columns in row detection (as well as row detection in general) please visit the <span style="background-color:white; padding:3px">[[#Advanced Row Detection]]</span> section of this article.
|}
|}
|valign=top|
 
[[File:2023_TabularLayout_003_Assign-a-Data-Column%27s-Value-Extractor_03.png]]
[[File:2023_TabularLayout_003_Assign-a-Data-Column%27s-Value-Extractor_03.png]]
|}
 


So far, we have:
So far, we have:
Line 324: Line 287:
# Configured at least one '''Data Column''' with its '''''Value Extractor''''' configured.
# Configured at least one '''Data Column''' with its '''''Value Extractor''''' configured.


For fairly simple table structures, we now have the two things the '''''Tabular Layout''''' method needs to extract data. Now, all we need to do is tell the '''Data Table''' object we want to use the '''''Tabular Layout''''' method. We do this by setting its '''''Extract Method''''' property to ''Tabular Layout''.
For fairly simple table structures, we now have the two things the '''''Tabular Layout''''' method needs to extract data. Now, all we need to do is tell the '''Data Table''' object we want to use the '''''Tabular Layout''''' method. We do this by setting its '''''Extract Method''''' property to ''Tabular Layout''.
</tab>
<tab name="3. Set Extract Method to Tabular Layout and Test" style="margin:20px">


{|cellpadding=10 cellspacing=5
==== 3. Set Extract Method to Tabular Layout ====
|valign=top style="width:40%"|
A '''Data Table's''' extraction method is set using the '''''Extract Method''''' property. To enable the '''''Tabular Layout''''' method, do the following.
=== 3. Set Extract Method to Tabular Layout ===
 
A '''Data Table's''' extraction method is set using the '''''Extract Method''''' property. To enable the '''''Tabular Layout''''' method, do the following.


# Select a '''Data Table''' object in your '''Data Model'''.
# Select a '''Data Table''' object in your '''Data Model'''.
Line 338: Line 296:
# Select the '''''Extract Method''''' property.
# Select the '''''Extract Method''''' property.
# Using the dropdown menu, select ''Tabular Layout''
# Using the dropdown menu, select ''Tabular Layout''
|
 
[[File:2023_TabularLayout_004_Set-Extract-Method-to-Tabular-Layout-and-Test_01.png]]
[[File:2023_TabularLayout_004_Set-Extract-Method-to-Tabular-Layout-and-Test_01.png]]
|-
|valign=top|
=== 4. Test ===


==== 4. Test ====
Now, let's test out what we have and see what we get!
Now, let's test out what we have and see what we get!


Line 349: Line 306:
# Press the "Test Extraction" button.
# Press the "Test Extraction" button.
# The results show up in the "Data Element Preview" window.
# The results show up in the "Data Element Preview" window.
#* Success!  Our table's data is collected!
 
|valign=top|
[[File:2023_TabularLayout_004_Set-Extract-Method-to-Tabular-Layout-and-Test_02.png]]
[[File:2023_TabularLayout_004_Set-Extract-Method-to-Tabular-Layout-and-Test_02.png]]
|-
 
|valign=top|
 
<br>
So, how was Grooper able to do this? For the ''Tabular Layout'' method, the '''Data Table''' is populated using primarily two pieces of information: column header locations established by the '''Data Columns'''' '''''Header Extractors''''' and rows locations detected by a '''Data Column's''' '''''Value Extractor'''''.
So, how was Grooper able to do this? For the ''Tabular Layout'' method, the '''Data Table''' is populated using primarily two pieces of information: column header locations established by the '''Data Columns'''' '''''Header Extractors''''' and rows locations detected by a '''Data Column's''' '''''Value Extractor'''''.
* Remember, we configured '''''Header Extractors''''' for ''all'' '''Data Columns'''. We configured ''only'' the "Quantity" '''Data Column's''' '''''Value Extractor''''''.
* Remember, we configured '''''Header Extractors''''' for ''all'' '''Data Columns'''. We configured ''only'' the "Quantity" '''Data Column's''' '''''Value Extractor''''''.


First, it's all about establishing column headers.
First, it's all about establishing column headers.
# The '''Data Columns'''' '''''Header Extractors''''' established the column locations for each column.
# The '''Data Columns'''' '''''Header Extractors''''' established the column locations for each column.
# Grooper then determines the ''width'' of these columns.
# Grooper then determines the ''width'' of these columns.
#* If table lines are present, Grooper can detect those line locations via a '''Line Detection''' (or '''Line Removal''') '''IP Command'''. Grooper will "snap" the column's width to the detected line boundaries, expanding the cell's width (and height) to the boundaries around it.
#* If table lines are present, Grooper can detect those line locations via a '''Line Detection''' (or '''Line Removal''') '''IP Command'''. Grooper will "snap" the column's width to the detected line boundaries, expanding the cell's width (and height) to the boundaries around it.
#** Table lines give human readers an indicator of where the data "lives" (or is contained). If it's in the box, it belongs to the column. If it's out of the box, it belongs to a different column.
#** Table lines give human readers an indicator of where the data "lives" (or is contained). If it's in the box, it belongs to the column. If it's out of the box, it belongs to a different column.
#* If table lines are ''not'' present (as is the case for this document), Grooper performs a variety of gutter-detection operations, analyzing the whitespace between columns to determine their widths.
#* If table lines are ''not'' present (as is the case for this document), Grooper performs a variety of gutter-detection operations, analyzing the whitespace between columns to determine their widths.
#** ''Most commonly'' Grooper will average the distance between one header label and the next.
#** ''Most commonly'' Grooper will average the distance between one header label and the next.
|valign=top|
 
[[File:2021-tabular-layout-without-label-sets-14.png]]
[[File:2021-tabular-layout-without-label-sets-14.png]]
|-
 
|valign=top|
 
<br>
Second, it's all about detecting rows. Rows are detected using a '''Data Column's''' '''''Value Extractor'''''.
Second, it's all about detecting rows. Rows are detected using a '''Data Column's''' '''''Value Extractor'''''.
* In our case, we configured the "Quantity" '''Data Column's''' '''''Value Extractor'''''.
* In our case, we configured the "Quantity" '''Data Column's''' '''''Value Extractor'''''.
* FYI:  When a '''Data Column's''' extractor is used to detect rows, it is considered "Primary Extraction". A '''Data Column's''' extractor can also be used for "Secondary Extraction", performed ''after'' rows are detected. For more on this, please visit the [[#Primary VS Secondary Extraction]] section of this article.
* FYI:  When a '''Data Column's''' extractor is used to detect rows, it is considered "Primary Extraction". A '''Data Column's''' extractor can also be used for "Secondary Extraction", performed ''after'' rows are detected. For more on this, please visit the [[#Primary VS Secondary Extraction]] section of this article.


# Rows are only detected below the detecting '''Data Column's''' header.
# Rows are only detected below the detecting '''Data Column's''' header.
Line 378: Line 332:
# For each result returned, Grooper establishes one row instance.
# For each result returned, Grooper establishes one row instance.
#* Since our extractor was designed to return decimal values, and Grooper found five decimal values below our column header, Grooper detected five rows.
#* Since our extractor was designed to return decimal values, and Grooper found five decimal values below our column header, Grooper detected five rows.
|valign=top|
 
[[File:2021-tabular-layout-without-label-sets-15.png]]
[[File:2021-tabular-layout-without-label-sets-15.png]]
|-
 
|valign=top|
 
<br>
The ''Tabular Layout'' method  now has the two pieces of information it needs to determine the table's structure. If you know where the columns are and how big they are, and you know how many rows there are, you pretty much know what the table looks like. Grooper can infer the table's grid-like structure using the column and row positions.
The ''Tabular Layout'' method  now has the two pieces of information it needs to determine the table's structure. If you know where the columns are and how big they are, and you know how many rows there are, you pretty much know what the table looks like. Grooper can infer the table's grid-like structure using the column and row positions.


# It has column instances for each '''Data Column'''.
# It has column instances for each '''Data Column'''.
Line 389: Line 342:
# It has row instances for each detected row.
# It has row instances for each detected row.
#* Again, established by the detecting '''Data Column's''' '''''Value Extractor'''''.
#* Again, established by the detecting '''Data Column's''' '''''Value Extractor'''''.
#** FYI:  More than one '''Data Column''' can be used to detect rows. Please visit the [[#Advanced Row Detection]] section for more information.
#** FYI:  More than one '''Data Column''' can be used to detect rows. Please visit the [[#Advanced Row Detection]] section for more information.
|valign=top|
 
[[File:2021-tabular-layout-without-label-sets-16.png]]
[[File:2021-tabular-layout-without-label-sets-16.png]]
|-
 
|valign=top|
 
<br>
With these column and row instances established, '''Grooper''' can form data instances for each cell of the table.
With these column and row instances established, '''Grooper''' can form data instances for each cell of the table.


#<li value=3> Each cell's data simply lays where the columns and rows intersect.</li>
#<li value=3> Each cell's data simply lays where the columns and rows intersect.</li>
#* For '''Data Columns''' ''with'' their '''''Value Extractors''''' configured, values are either collected using "Primary" or "Secondary Extraction". Please see the [[#Primary VS Secondary Extraction]] portion for more information.
#* For '''Data Columns''' ''with'' their '''''Value Extractors''''' configured, values are either collected using "Primary" or "Secondary Extraction". Please see the [[#Primary VS Secondary Extraction]] portion for more information.
#* For '''Data Columns''' ''without'' their '''''Value Extractors''''' configured, values are collected by returning the OCR or native text data within the geometric boundaries of the cell.
#* For '''Data Columns''' ''without'' their '''''Value Extractors''''' configured, values are collected by returning the OCR or native text data within the geometric boundaries of the cell.
#** This is ''extremely'' beneficial for data that is difficult to extract using pattern matching.
#** This is ''extremely'' beneficial for data that is difficult to extract using pattern matching.
#** For example, invoice item numbers and descriptions are notoriously difficult to pattern match. By using something in the table that ''is'' easy to pattern match, like our item quantities, we can use '''''Tabular Layout''''' to model the table structure and collect the other column values that are ''not''.
#** For example, invoice item numbers and descriptions are notoriously difficult to pattern match. By using something in the table that ''is'' easy to pattern match, like our item quantities, we can use '''''Tabular Layout''''' to model the table structure and collect the other column values that are ''not''.
|
 
[[File:2021-tabular-layout-without-label-sets-17.png]]
[[File:2021-tabular-layout-without-label-sets-17.png]]
|}
</tab>
<tab name = "4. Alternative Configuration:  Header Row Extractor" style="margin:20px">
===Alternative Configuration:  Header Row Extractor===


You may alternatively establish column headers for the ''entire'' row of header labels, using the '''''Header Row Extractor''''' property. Instead of configuring each '''Data Column's''' '''''Header Extractor''''', you would configure an extractor to return the whole table's row of column headers and use named instances (either Named Groups or child extractors) to establish each '''Data Column's''' header.
==== 5. Alternative Configuration: Header Row Extractor ====
 
You may alternatively establish column headers for the ''entire'' row of header labels, using the '''''Header Row Extractor''''' property. Instead of configuring each '''Data Column's''' '''''Header Extractor''''', you would configure an extractor to return the whole table's row of column headers and use named instances (either Named Groups or child extractors) to establish each '''Data Column's''' header.


There are two reasons using a '''''Header Row Extractor''''' can be beneficial:
There are two reasons using a '''''Header Row Extractor''''' can be beneficial:
Line 419: Line 369:
|⚠
|⚠
|
|
Configuring the '''''Header Row Extractor''''' will ''override'' all '''Data Columns''' '''''Header Extractors'''''.
Configuring the '''''Header Row Extractor''''' will ''override'' all '''Data Columns''' '''''Header Extractors'''''.  


You should choose to ''either'' establish column headers using the '''''Header Row Extractor''''' ''or'' do so using each '''Data Column's''' '''''Header Extractors'''''.
You should choose to ''either'' establish column headers using the '''''Header Row Extractor''''' ''or'' do so using each '''Data Column's''' '''''Header Extractors'''''.
Line 426: Line 376:
|}
|}


{|cellpadding=10 cellspacing=5
===== Craft the Extractor =====
|valign=top style="width:40%"|
To configure the '''''Header Row Extractor''''', you will need to craft an extractor (or multiple extractors for multiple table formats). We will choose to do that first by creating a few '''Value Reader''' and '''Data Types'''.
=== Craft the Extractor ===
 
To configure the '''''Header Row Extractor''''', you will need to craft an extractor (or multiple extractors for multiple table formats). We will choose to do that first by creating a few '''Value Reader''' and '''Data Types'''.


# We've started creating a '''Value Reader''' to use as a '''''Header Row Extractor''''' for the "Fairdeal" '''Document Type''' in our '''Content Model'''.
# We've started creating a '''Value Reader''' to use as a '''''Header Row Extractor''''' for the "Fairdeal" '''Document Type''' in our '''Content Model'''.
# We're using a ''Pattern Match'' '''''Extractor Type'''''.
# We're using a ''Pattern Match'' extractor.  
#* We can easily match the header row for "Fairdeal" invoices using a simple regex pattern.
#* We can easily match the header row for "Fairdeal" invoices using a simple regex pattern.
# Your first task will be to extract the ''entire'' row of column headers. The pattern we have here will do just that.
# Your first task will be to extract the ''entire'' row of column headers. The pattern we have here will do just that.
<pre>
<pre>
DESCRIPTION\t
DESCRIPTION\t
Line 445: Line 392:
</pre>
</pre>
#<li value=4> The pattern matches the whole row of column headers.</li>
#<li value=4> The pattern matches the whole row of column headers.</li>
|valign=top|
 
[[File:2023_TabularLayout_005_Alternative-Configuration---Header-Row-Extractor_01.png]]
[[File:2023_TabularLayout_005_Alternative-Configuration---Header-Row-Extractor_01.png]]
|}


This is only step one. Next, we need some way of breaking up the result into each component column. How does Grooper know what part of the result is the label for the "Description" column or the "Quantity" column?  It doesn't until you break up the result into ''named instances'' that match the names of your '''Data Columns''' in the '''Data Table'''. These named instances can either be:
 
This is only step one. Next, we need some way of breaking up the result into each component column. How does Grooper know what part of the result is the label for the "Description" column or the "Quantity" column?  It doesn't until you break up the result into ''named instances'' that match the names of your '''Data Columns''' in the '''Data Table'''. These named instances can either be:
* Named Groups
* Named Groups
* Named Child Extractors
* Named Child Extractors


{|cellpadding=10 cellspacing=5
====== Assign Named Instances:  Using Named Groups ======
|valign=top style="width:40%"|
=== Assign Named Instances:  Using Named Groups ===


When pattern matching a header row, you can do this with Named Groups.
When pattern matching a header row, you can do this with Named Groups.
Line 472: Line 417:
#* Since the names match, Grooper will use the Named Group's instance to establish the column header for the "Description" '''Data Column'''.
#* Since the names match, Grooper will use the Named Group's instance to establish the column header for the "Description" '''Data Column'''.
#* Effectively, the Named Group supplies the result for the '''Data Column's''' '''''Header Extractor'''''.
#* Effectively, the Named Group supplies the result for the '''Data Column's''' '''''Header Extractor'''''.
#** BE AWARE! This also means the Named Group ''replaces'' the result of a '''Data Column's''' '''''Header Extractor'''''. If you configure a '''''Header Row Extractor''''', it will supersede any '''''Header Extractor''''' on any '''Data Column'''.
#** BE AWARE! This also means the Named Group ''replaces'' the result of a '''Data Column's''' '''''Header Extractor'''''. If you configure a '''''Header Row Extractor''''', it will supersede any '''''Header Extractor''''' on any '''Data Column'''.
|valign=top|
 
[[File:2023_TabularLayout_005_Alternative-Configuration---Header-Row-Extractor_02.png]]
[[File:2023_TabularLayout_005_Alternative-Configuration---Header-Row-Extractor_02.png]]
|-
 
|valign=top|
 
<br>
#<li value=4> You would then continue placing Named Groups around the remaining column headers, chunking out the regex and matching each chunk with the corresponding '''Data Column'''.</li>
#<li value=4> You would then continue placing Named Groups around the remaining column headers, chunking out the regex and matching each chunk with the corresponding '''Data Column'''.</li>


Line 494: Line 438:
|⚠
|⚠
|
|
Please note space characters are not allowed in Named Groups. You must replace a space character <code> </code> with an underscore <code>_</code>.
Please note space characters are not allowed in Named Groups. You must replace a space character <code> </code> with an underscore <code>_</code>.


For example, to match the "Item Number" '''Data Column''', we named the group <code>Item_Number</code>
For example, to match the "Item Number" '''Data Column''', we named the group <code>Item_Number</code>
|}
|}
|valign=top|
 
[[File:2023_TabularLayout_005_Alternative-Configuration---Header-Row-Extractor_03.png]]
[[File:2023_TabularLayout_005_Alternative-Configuration---Header-Row-Extractor_03.png]]
|-
|valign=top|
=== Assign Named Instances:  Using Named Child Extractors ===


====== Assign Named Instances:  Using Named Child Extractors ======
You may also create and use the named instances by naming a '''Data Type's''' child extractors to match the names of your '''Data Columns'''.
You may also create and use the named instances by naming a '''Data Type's''' child extractors to match the names of your '''Data Columns'''.


Line 511: Line 453:
# Each child extractor's name matches one of our '''Data Columns'''.
# Each child extractor's name matches one of our '''Data Columns'''.
# Inspecting the header row's instance (by right-clicking the result in the '''''Results''''' list), we can see more clearly how these results sub-instances will be supplied as each '''Data Column's''' header.
# Inspecting the header row's instance (by right-clicking the result in the '''''Results''''' list), we can see more clearly how these results sub-instances will be supplied as each '''Data Column's''' header.
|valign=top|
 
[[File:2023_TabularLayout_005_Alternative-Configuration---Header-Row-Extractor_04.png]]
[[File:2023_TabularLayout_005_Alternative-Configuration---Header-Row-Extractor_04.png]]
|-
 
|valign=top|
 
<br>
# In the Instance Viewer, we can select any of our sub-instances from our child extractors.
# In the Instance Viewer, we can select any of our sub-instances from our child extractors.
# This result is what will be used for the "Description" column's header.
# This result is what will be used for the "Description" column's header.
# Since the name of the child extractor (and therefore also sub-instance) matches the "Description" '''Data Column''', the result will be used in place of its '''''Header Extractor'''''.
# Since the name of the child extractor (and therefore also sub-instance) matches the "Description" '''Data Column''', the result will be used in place of its '''''Header Extractor'''''.
|valign=top|
 
[[File:2023_TabularLayout_005_Alternative-Configuration---Header-Row-Extractor_05.png]]
[[File:2023_TabularLayout_005_Alternative-Configuration---Header-Row-Extractor_05.png]]
|-
|valign=top|
=== Assign the Header Row Extractor ===


===== Assign the Header Row Extractor =====
Now that we have a couple examples of header row extractors, we can assign them using '''''Tabular Layout's''''' '''''Header Row Extractor''''' property.
Now that we have a couple examples of header row extractors, we can assign them using '''''Tabular Layout's''''' '''''Header Row Extractor''''' property.


Line 532: Line 471:
# Using the '''''Header Row Extractor''''' property, configure your header row extractor.
# Using the '''''Header Row Extractor''''' property, configure your header row extractor.
# In our case, we set '''''Header Row Extractor''''' to ''Reference'' and pointed to one of the extractors detailed previously.
# In our case, we set '''''Header Row Extractor''''' to ''Reference'' and pointed to one of the extractors detailed previously.
# When the '''Data Table''' extracts, the '''''Tabular Layout''''' method will use the '''''Header Row Extractor's''''' named instances to establish each '''Data Column's''' header locations.
 
|valign=top|
[[File:2023_TabularLayout_005_Alternative-Configuration---Header-Row-Extractor_06.png]]
[[File:2023_TabularLayout_005_Alternative-Configuration---Header-Row-Extractor_06.png]]
|-
 
|valign=top|
 
<br>
{|class="attn-box"
{|class="attn-box"
|-
|-
Line 546: Line 483:
|}
|}


The extractor we referenced was very specifically designed with only one table format in mind. It works for invoices assigned the "Fairdeal" '''Document Type''', but no others.
The extractor we referenced was very specifically designed with only one table format in mind. It works for invoices assigned the "Fairdeal" '''Document Type''', but no others.
# If we were to test our '''Data Table''' on a different document with a different table structure, we would get no results.
# If we were to test our '''Data Table''' on a different document with a different table structure, we would get no results.
# Because the extractor doesn't match this table format's row of column headers, it can't establish any column headers for this document.
# Because the extractor doesn't match this table format's row of column headers, it can't establish any column headers for this document.
Line 555: Line 492:
* Craft a single extractor that matches multiple row header formats.
* Craft a single extractor that matches multiple row header formats.
* Or, use '''Data Element Overrides''' to configure a unique '''''Header Row Extractor''''' for each '''Document Type'''.
* Or, use '''Data Element Overrides''' to configure a unique '''''Header Row Extractor''''' for each '''Document Type'''.
|valign=top|
 
[[File:2023_TabularLayout_005_Alternative-Configuration---Header-Row-Extractor_07.png]]
[[File:2023_TabularLayout_005_Alternative-Configuration---Header-Row-Extractor_07.png]]
|}


=== Why Bother? ===
===== Why Bother? =====


There are two main reasons why '''''Header Row Extractors''''' can be beneficial:
There are two main reasons why '''''Header Row Extractors''''' can be beneficial:
Line 565: Line 501:
# To better match column headers with poor OCR using Fuzzy RegEx.
# To better match column headers with poor OCR using Fuzzy RegEx.


{|cellpadding = 10 cellspacing=5
====== To Throw Out False Positives ======
|valign=top style="width:40%"|
=== To Throw Out False Positives ===


The first reason to use a '''''Header Row Extractor''''' is to help eliminate false positive column header matches.
The first reason to use a '''''Header Row Extractor''''' is to help eliminate false positive column header matches.
Line 573: Line 507:
# Take our "Line Total" '''Data Column'''.
# Take our "Line Total" '''Data Column'''.
# Its '''''Header Extractor''''' is configured with '''''List Match''''' extractor, matching a variety of possible header labels for this column
# Its '''''Header Extractor''''' is configured with '''''List Match''''' extractor, matching a variety of possible header labels for this column
|valign=top|
 
[[File:2023_TabularLayout_005_Alternative-Configuration---Header-Row-Extractor_08.png]]
[[File:2023_TabularLayout_005_Alternative-Configuration---Header-Row-Extractor_08.png]]
|-
 
|valign=top|
 
<br>
# This table format uses the label <code>SUBTOTAL</code> for the "Line Total" column.
# This table format uses the label <code>SUBTOTAL</code> for the "Line Total" column.
# It certainly matches the column header correctly.
# It certainly matches the column header correctly.
# But it also matches an instance on this document where the same term is used to refer to something ''different''.
# But it also matches an instance on this document where the same term is used to refer to something ''different''.
#* This is a false positive match.
#* This is a false positive match.
|valign=top|
 
[[File:2023_TabularLayout_005_Alternative-Configuration---Header-Row-Extractor_09.png]]
[[File:2023_TabularLayout_005_Alternative-Configuration---Header-Row-Extractor_09.png]]
|-
 
|valign=top|
 
A row of header labels tends to be more specific (and requires more specific extraction logic).
A row of header labels tends to be more specific (and requires more specific extraction logic).


Line 593: Line 526:




This is to be sure a more specific, and therefore more accurate extractor. However, you shouldn't always assume more accurate is necessarily "necessary". In this case, the false positive did not impact our table whatsoever. So, while, yes, the '''''Header Row Extractor''''' is technically more accurate, our '''Data Table''' would have returned accurate data using '''Data Column''' headers alone (even with the false positive match).
This is to be sure a more specific, and therefore more accurate extractor. However, you shouldn't always assume more accurate is necessarily "necessary". In this case, the false positive did not impact our table whatsoever. So, while, yes, the '''''Header Row Extractor''''' is technically more accurate, our '''Data Table''' would have returned accurate data using '''Data Column''' headers alone (even with the false positive match).
* While a '''''Header Row Extractor''''' can eliminate false positive column header matches, you only need to go through the trouble of configuring one ''if those false positive matches poorly impact your data extraction''.
* While a '''''Header Row Extractor''''' can eliminate false positive column header matches, you only need to go through the trouble of configuring one ''if those false positive matches poorly impact your data extraction''.
|valign=top|
 
[[File:2023_TabularLayout_005_Alternative-Configuration---Header-Row-Extractor_10.png]]
[[File:2023_TabularLayout_005_Alternative-Configuration---Header-Row-Extractor_10.png]]
|}


=== For Fuzzy RegEx ===
====== For Fuzzy RegEx ======
The other reason to use a '''''Header Row Extractor''''' has to do with imperfect OCR text data and ''[[Fuzzy RegEx]]''. ''Fuzzy RegEx'' provides a way for regular expression patterns to match in '''Grooper''' when the text data doesn't ''strictly'' match the pattern. The difference between the regex pattern <code>Grooper</code> and the character string "Gro0per" is just off by a single character. An OCR engine misreading an "o" character for a zero is not uncommon by any means, but a standard regex pattern of <code>Grooper</code> will not match the string "Gro0per". The pattern expects there to be an "o" where there is a zero.


The other reason to use a '''''Header Row Extractor''''' has to do with imperfect OCR text data and ''[[Fuzzy RegEx]]''''Fuzzy RegEx'' provides a way for regular expression patterns to match in '''Grooper''' when the text data doesn't ''strictly'' match the pattern.  The difference between the regex pattern <code>Grooper</code> and the character string "Gro0per" is just off by a single character. An OCR engine misreading an "o" character for a zero is not uncommon by any means, but a standard regex pattern of <code>Grooper</code> will not match the string "Gro0per". The pattern expects there to be an "o" where there is a zero.
Using ''Fuzzy RegEx'' instead of regular regex, '''Grooper''' will evaluate the difference between the regex pattern and the string. If it's similar enough (if it falls within a percentage similarity threshold) '''Grooper''' will return it as a match.
* FYI "similarity" may also be referred to as "confidence" when evaluating (or scoring) fuzzy match results. '''Grooper''' is more or less "confident" the result matches the regex pattern based on the fuzzy regex similarity between the pattern and the imperfect text data. A similarity of 90% and a confidence score of 90% are functionally the same thing (One could argue there is a difference between these two terms when '''''Fuzzy Match Weightings''''' come into play, but that's a whole different topic. And you may encounter '''Grooper''' users who use the terms "similarity" and "confidence" interchangeably regardless. Visit the [[Fuzzy RegEx]] article if you would like to learn more).


Using ''Fuzzy RegEx'' instead of regular regex, '''Grooper''' will evaluate the difference between the regex pattern and the string.  If it's similar enough (if it falls within a percentage similarity threshold) '''Grooper''' will return it as a match.
* FYI "similarity" may also be referred to as "confidence" when evaluating (or scoring) fuzzy match results.  '''Grooper''' is more or less "confident" the result matches the regex pattern based on the fuzzy regex similarity between the pattern and the imperfect text data.  A similarity of 90% and a confidence score of 90% are functionally the same thing (One could argue there is a difference between these two terms when '''''Fuzzy Match Weightings''''' come into play, but that's a whole different topic.  And you may encounter '''Grooper''' users who use the terms "similarity" and "confidence" interchangeably regardless.  Visit the [[Fuzzy RegEx]] article if you would like to learn more).


{|cellpadding=10 cellspacing=5
|valign=top style="width:40%"|
<br>
Let's go back to the '''''List Match''''' extractor for our "Line Total" '''Data Column's''' '''''Header Extractor'''''.
Let's go back to the '''''List Match''''' extractor for our "Line Total" '''Data Column's''' '''''Header Extractor'''''.


Line 615: Line 544:
# Why not?  This is due to imperfect OCR results.
# Why not?  This is due to imperfect OCR results.
#* The label <code>TOTAL</code> was misrecognized as <code>TOFAL</code>.
#* The label <code>TOTAL</code> was misrecognized as <code>TOFAL</code>.
|valign=top|
 
[[File:2023_TabularLayout_005_Alternative-Configuration---Header-Row-Extractor_11.png]]
[[File:2023_TabularLayout_005_Alternative-Configuration---Header-Row-Extractor_11.png]]
|-
 
|valign=top|
 
We can certainly get this label to match with Fuzzy RegEx, but only at a fairly low similarity.
We can certainly get this label to match with Fuzzy RegEx, but only at a fairly low similarity.


Line 624: Line 553:
# We do get our header label returned.
# We do get our header label returned.
# But it's at a confidence score of 86%.
# But it's at a confidence score of 86%.
#* This score may be too low. It's not causing a problem for ''this'' document, but it may pose issues for others.
#* This score may be too low. It's not causing a problem for ''this'' document, but it may pose issues for others.
|valign=top|
 
[[File:2023_TabularLayout_005_Alternative-Configuration---Header-Row-Extractor_12.png]]
[[File:2023_TabularLayout_005_Alternative-Configuration---Header-Row-Extractor_12.png]]
|-
|valign=top|
<br>
The reason why the similarity score is so low is because "TOTAL" is a relatively small word, five characters long.  Grooper's confidence rating in a match lessens, the more character swaps it has to make to match the word.


An entire row of headers, on the other hand, has ''much more'' characters in it. The cost to swap a single character in the entire row of headers will be much less, and much more negligible.
 
The reason why the similarity score is so low is because "TOTAL" is a relatively small word, five characters long. Grooper's confidence rating in a match lessens, the more character swaps it has to make to match the word.
 
An entire row of headers, on the other hand, has ''much more'' characters in it. The cost to swap a single character in the entire row of headers will be much less, and much more negligible.


# This '''Value Reader''' is designed to match the whole header row for this invoice format.
# This '''Value Reader''' is designed to match the whole header row for this invoice format.
Line 638: Line 566:
# The whole header row matches at a much higher confidence score of 98%.
# The whole header row matches at a much higher confidence score of 98%.


|valign=top|
[[File:2023_TabularLayout_005_Alternative-Configuration---Header-Row-Extractor_13.png]]
[[File:2023_TabularLayout_005_Alternative-Configuration---Header-Row-Extractor_13.png]]
|}
</tab>
:[[#Tabular Layout Without Label Sets|Click here to return to the top]]
</tabs>


==== Disabling Data Columns for Specific Document Types ====
==== Disabling Data Columns for Specific Document Types ====


Occasionally, you will run into a situation where you want to collect a column that exists for some document formats but not for others. You will need to utilize '''Data Element Overrides''' to account for this.
Occasionally, you will run into a situation where you want to collect a column that exists for some document formats but not for others. You will need to utilize '''Data Element Overrides''' to account for this.


For example, some of these invoices list a "unit of measure". The customer is invoiced for "1 each" of a product or "2 hours" of a service. "Each" or "hours" is the unit of measure. However, not ''all'' invoices have a column for this in their line items. You may want to collect the unit of measure if the column is present. So, you would add a "Unit" '''Data Column''' and configure its '''''Header Extractor''''' and, if necessary, '''''Value Extractor''''' properties.
For example, some of these invoices list a "unit of measure". The customer is invoiced for "1 each" of a product or "2 hours" of a service. "Each" or "hours" is the unit of measure. However, not ''all'' invoices have a column for this in their line items. You may want to collect the unit of measure if the column is present. So, you would add a "Unit" '''Data Column''' and configure its '''''Header Extractor''''' and, if necessary, '''''Value Extractor''''' properties.
 
But obviously, you can't collect it from documents where there is no "unit of measure" column. The "Factura" '''Document Type''' is one such vendor who does ''not'' list a unit of measure. You would need to ''remove'' the '''''Header Extractor''''' in the '''Document Type's''' "Overrides" panel.


But obviously, you can't collect it from documents where there is no "unit of measure" column.  The "Factura" '''Document Type''' is one such vendor who does ''not'' list a unit of measure.  You would need to ''remove'' the '''''Header Extractor''''' in the '''Document Type's''' "Overrides" panel.


{|cellpadding=10 cellspacing=5
|valign=top style="width:40%"|
<br>
# Here, we've selected the "Factura" '''Document Type'''.
# Here, we've selected the "Factura" '''Document Type'''.
# Navigate to the "Overrides" tab to configure '''Data Element Overrides'''.
# Navigate to the "Overrides" tab to configure '''Data Element Overrides'''.
Line 661: Line 582:
#* In this case, since the "Unit" column does not exist for the "Factura" '''Document Type''' we are removing its '''''Header Extractor'''''.
#* In this case, since the "Unit" column does not exist for the "Factura" '''Document Type''' we are removing its '''''Header Extractor'''''.
# Change the '''''Header Extractor''''' property to ''(none)''.
# Change the '''''Header Extractor''''' property to ''(none)''.
# <span style="color:white; background-color: #36b0a7; padding: 3px">'''FYI'''</span> It would be beneficial to turn this '''Data Column's''' '''''Visible''''' property to ''False'' in this case. This would not affect extraction, but it would remove the column from a data reviewer's sight.
# <span style="color:white; background-color: #36b0a7; padding: 3px">'''FYI'''</span> It would be beneficial to turn this '''Data Column's''' '''''Visible''''' property to ''False'' in this case. This would not affect extraction, but it would remove the column from a data reviewer's sight.




By removing the absent column's '''''Header Extractor''''' Grooper is no longer looking for a header that is not there!  The table will then extract successfully.
By removing the absent column's '''''Header Extractor''''' Grooper is no longer looking for a header that is not there!  The table will then extract successfully.
|valign=top|
 
[[File:2023_TabularLayout_006_Disabling-Data-Columns-for-Specific-Document-Types.png]]
[[File:2023_TabularLayout_006_Disabling-Data-Columns-for-Specific-Document-Types.png]]
|}


=== Tabular Layout With Label Sets ===
=== Tabular Layout With Label Sets ===
 
==== Overview ====
<tabs style="margin:20px">
This tutorial will cover the basic configuration of the '''''Tabular Layout''''' method ''with'' Label Sets, using a '''''Labeling Behavior''''' to collect column headers. We will use invoices for our document set and collect the following data from their tables detailing line item information:
<tab name="Overview" style="margin:20px">
=== Overview ===
 
This tutorial will cover the basic configuration of the '''''Tabular Layout''''' method ''with'' Label Sets, using a '''''Labeling Behavior''''' to collect column headers. We will use invoices for our document set and collect the following data from their tables detailing line item information:
* Item Number - The vendor's id number for the item ordered for each row.
* Item Number - The vendor's id number for the item ordered for each row.
* Description - The description of each item ordered for each row.
* Description - The description of each item ordered for each row.
Line 682: Line 598:
* Line Total - The total price for the number of items ordered (In other words, the quantity ordered multiplied by the unit price)
* Line Total - The total price for the number of items ordered (In other words, the quantity ordered multiplied by the unit price)


{|cellpadding=10 cellspacing=5
 
|valign=top style="width:40%"|
The basic steps will be as follows:
The basic steps will be as follows:


Line 691: Line 606:
#** It is also considered best practice to do so when using Label Sets to configure '''''Tabular Layout'''''.
#** It is also considered best practice to do so when using Label Sets to configure '''''Tabular Layout'''''.
# Assign a '''''Value Extractor''''' for at least one '''Data Column'''.
# Assign a '''''Value Extractor''''' for at least one '''Data Column'''.
#* For example, we may expect to find a quantity for each item shipped on an invoice, regardless of the vendor. There's always a column with a "Quantity" or "QTY" or "Shipped" or some similar header.
#* For example, we may expect to find a quantity for each item shipped on an invoice, regardless of the vendor. There's always a column with a "Quantity" or "QTY" or "Shipped" or some similar header.
#* Since this data is also present on ''every row'', this will provide the information necessary to find each row in the table.
#* Since this data is also present on ''every row'', this will provide the information necessary to find each row in the table.
#* While you need ''at least'' one '''Data Column's''' '''''Value Extractor''''' configured to detect rows, multiple columns may be used to detect rows.
#* While you need ''at least'' one '''Data Column's''' '''''Value Extractor''''' configured to detect rows, multiple columns may be used to detect rows.  
#** Furthermore, a '''Data Column's''' '''''Value Extractor''''' will either perform "Primary Extraction" to perform row detection or "Secondary Extraction" to extract data from already detected rows. We will discus using multiple columns to detect rows and the differences between "Primary" and "Secondary Extraction" in the [[#Advanced Setup Considerations]] section of this article.
#** Furthermore, a '''Data Column's''' '''''Value Extractor''''' will either perform "Primary Extraction" to perform row detection or "Secondary Extraction" to extract data from already detected rows. We will discus using multiple columns to detect rows and the differences between "Primary" and "Secondary Extraction" in the [[#Advanced Setup Considerations]] section of this article.
|valign=top|
 
[[File:2023_TabularLayout_007_Tabular-Layout-with-Label-Sets---Overview_01.png]]
[[File:2023_TabularLayout_007_Tabular-Layout-with-Label-Sets---Overview_01.png]]
|-
 
|valign=top|
 
<br>
#<li value=3> Set the '''Data Table''' object's '''''Extract Method''''' property to ''Tabular Layout''.</li>
#<li value=3> Set the '''Data Table''' object's '''''Extract Method''''' property to ''Tabular Layout''.</li>
#* And configure any '''''Tabular Layout''''' properties as needed. We will discuss many of these properties, why and how to to use them in the [[#Advanced Setup Considerations]] section of this article.
#* And configure any '''''Tabular Layout''''' properties as needed. We will discuss many of these properties, why and how to to use them in the [[#Advanced Setup Considerations]] section of this article.
# Test to ensure the table's data is collected.
# Test to ensure the table's data is collected.
|valign=top|
 
[[File:2023_TabularLayout_007_Tabular-Layout-with-Label-Sets---Overview_02.png]]
[[File:2023_TabularLayout_007_Tabular-Layout-with-Label-Sets---Overview_02.png]]
|}


In a perfect world, you're done at that point.  As you can see in this example, we've populated a table.  Data is collected for all four '''Data Columns''' for each row on the document.


However, the world is rarely perfect. We will discuss some further configuration considerations to help you get the most out of this table extraction method in the [[#Advanced Setup Considerations]] section below.
In a perfect world, you're done at that point. As you can see in this example, we've populated a table. Data is collected for all four '''Data Columns''' for each row on the document.
</tab>
 
<tab name="1. Collect Column Labels" style="margin:20px">
However, the world is rarely perfect. We will discuss some further configuration considerations to help you get the most out of this table extraction method in the [[#Advanced Setup Considerations]] section below.


=== 1. Collect Column Labels ===
==== 1. Collect Column Labels ====


''The following tutorial will presume you have general familiarity with collecting labels. See the [[Label Sets]] article for a full explanation of how to collect labels for Document Types in a Content Model.''
''The following tutorial will presume you have general familiarity with collecting labels. See the [[Label Sets]] article for a full explanation of how to collect labels for Document Types in a Content Model.''


As far as ''strict'' requirements go for the '''''Tabular Layout''''' method goes, you must ''at minimum'' establish column headers for each '''Data Column''' you wish to extract.
As far as ''strict'' requirements go for the '''''Tabular Layout''''' method goes, you must ''at minimum'' establish column headers for each '''Data Column''' you wish to extract.
Line 722: Line 634:
* FYI:  If the invoice lists both a "quantity ordered" and a "quantity shipped" column, we will be collecting the quantity ''shipped''.
* FYI:  If the invoice lists both a "quantity ordered" and a "quantity shipped" column, we will be collecting the quantity ''shipped''.


{|cellpadding=10 cellspacing=5
 
|valign=top style="width:40%"|
<br>
For this "Fairdeal" '''Document Type''', one column header label has been collected for each of the five '''Data Column''' children of the "Line Items" Data Table.
For this "Fairdeal" '''Document Type''', one column header label has been collected for each of the five '''Data Column''' children of the "Line Items" Data Table.


Line 735: Line 645:




As far as strict requirements go for establishing header columns, you're done at this point. You would then repeat this same process for every '''Document Type''' in your '''Content Model'''.
As far as strict requirements go for establishing header columns, you're done at this point. You would then repeat this same process for every '''Document Type''' in your '''Content Model'''.
|valign=top|
 
[[File:2023_TabularLayout_008_Tabular-Layout-with-Label-Sets---Collect-Column-Labels_01.png]]
[[File:2023_TabularLayout_008_Tabular-Layout-with-Label-Sets---Collect-Column-Labels_01.png]]
|}


=== Best Practice:  Collect a Header Row Label for the Data Table ===
 
===== Best Practice:  Collect a Header Row Label for the Data Table =====


You may optionally collect a label for the entire row of column header labels (aka the "header row label"). This label is collected for the parent '''Data Table''' object's label.
You may optionally collect a label for the entire row of column header labels (aka the "header row label"). This label is collected for the parent '''Data Table''' object's label.
{|cellpadding=10 cellspacing=5
 
|valign=top style="width:40%"|
 
<br>
# We've collected the label <code>DESCRIPTION ITEM NO HRS / QTY PER RATE / PRICE SUBTOTAL</code> for the "Line Items" '''Data Table'''.
# We've collected the label <code>DESCRIPTION ITEM NO HRS / QTY PER RATE / PRICE SUBTOTAL</code> for the "Line Items" '''Data Table'''.




It is considered best practice to capture a header row label for the '''Data Table'''. But if it's optional, why do it? What is the benefit of this label?
It is considered best practice to capture a header row label for the '''Data Table'''. But if it's optional, why do it? What is the benefit of this label?
|valign=top|
 
[[File:2023_TabularLayout_008_Tabular-Layout-with-Label-Sets---Collect-Column-Labels_02.png]]
[[File:2023_TabularLayout_008_Tabular-Layout-with-Label-Sets---Collect-Column-Labels_02.png]]
|-
|valign=top colspan=2|
=== Why Bother? ===


===== Why Bother? =====
There are two main reasons why Header Row Extractors can be beneficial:
There are two main reasons why Header Row Extractors can be beneficial:


# To throw out false positive column header matches
# To throw out false positive column header matches
# To better match column headers with poor OCR using Fuzzy RegEx.
# To better match column headers with poor OCR using Fuzzy RegEx.
|-
|valign=top|
=== To Throw Out False Positives ===


====== To Throw Out False Positives ======
The first reason to collect a header row label is to help eliminate false positive column header matches.
The first reason to collect a header row label is to help eliminate false positive column header matches.


# Take our "Line Total" '''Data Column's''' label <code>SUBTOTAL</code>.
# Take our "Line Total" '''Data Column's''' label <code>SUBTOTAL</code>.
# Without the '''Data Table's''' header row label, this label would ''also'' produce a match.
# Without the '''Data Table's''' header row label, this label would ''also'' produce a match.
#* This is a false positive match. This is an instance on this document where the same term is used to refer to something different.
#* This is a false positive match. This is an instance on this document where the same term is used to refer to something different.
# With the header row label, only the actual label for the column matches.
# With the header row label, only the actual label for the column matches.
#* Another way of putting it:  The '''Data Column''' header labels will only match if they are part of the larger '''Data Table''' header row label.
#* Another way of putting it:  The '''Data Column''' header labels will only match if they are part of the larger '''Data Table''' header row label.
|valign=top|
 
[[File:2023_TabularLayout_008_Tabular-Layout-with-Label-Sets---Collect-Column-Labels_03.png]]
[[File:2023_TabularLayout_008_Tabular-Layout-with-Label-Sets---Collect-Column-Labels_03.png]]
|-
|valign=top colspan=2|
=== For Fuzzy RegEx ===


The other reason to collect a header row label has to do with imperfect OCR text data and ''[[Fuzzy RegEx]]''. ''Fuzzy RegEx'' provides a way for regular expression patterns to match in '''Grooper''' when the text data doesn't ''strictly'' match the pattern. The difference between the regex pattern <code>Grooper</code> and the character string "Gro0per" is just off by a single character. An OCR engine misreading an "o" character for a zero is not uncommon by any means, but a standard regex pattern of <code>Grooper</code> will not match the string "Gro0per". The pattern expects there to be an "o" where there is a zero.
====== For Fuzzy RegEx ======
The other reason to collect a header row label has to do with imperfect OCR text data and ''[[Fuzzy RegEx]]''. ''Fuzzy RegEx'' provides a way for regular expression patterns to match in '''Grooper''' when the text data doesn't ''strictly'' match the pattern. The difference between the regex pattern <code>Grooper</code> and the character string "Gro0per" is just off by a single character. An OCR engine misreading an "o" character for a zero is not uncommon by any means, but a standard regex pattern of <code>Grooper</code> will not match the string "Gro0per". The pattern expects there to be an "o" where there is a zero.
 
Using ''Fuzzy RegEx'' instead of regular regex, '''Grooper''' will evaluate the difference between the regex pattern and the string. If it's similar enough (if it falls within a percentage similarity threshold) '''Grooper''' will return it as a match.
* FYI: "Similarity" may also be referred to as "confidence" when evaluating (or scoring) fuzzy match results. '''Grooper''' is more or less "confident" the result matches the regex pattern based on the fuzzy regex similarity between the pattern and the imperfect text data. A similarity of 90% and a confidence score of 90% are functionally the same thing (One could argue there is a difference between these two terms when '''''Fuzzy Match Weightings''''' come into play, but that's a whole different topic. And you may encounter '''Grooper''' users who use the terms "similarity" and "confidence" interchangeably regardless. Visit the [[Fuzzy RegEx]] article if you would like to learn more).
 


Using ''Fuzzy RegEx'' instead of regular regex, '''Grooper''' will evaluate the difference between the regex pattern and the string.  If it's similar enough (if it falls within a percentage similarity threshold) '''Grooper''' will return it as a match.
* FYI: "Similarity" may also be referred to as "confidence" when evaluating (or scoring) fuzzy match results.  '''Grooper''' is more or less "confident" the result matches the regex pattern based on the fuzzy regex similarity between the pattern and the imperfect text data.  A similarity of 90% and a confidence score of 90% are functionally the same thing (One could argue there is a difference between these two terms when '''''Fuzzy Match Weightings''''' come into play, but that's a whole different topic.  And you may encounter '''Grooper''' users who use the terms "similarity" and "confidence" interchangeably regardless.  Visit the [[Fuzzy RegEx]] article if you would like to learn more).
|-
|valign=top|
<br>
So how does this apply to the '''Data Table's''' header row label? The short answer is it provides a way to increase the accuracy of '''Data Column''' header labels by "boosting" the similarity of the label to imperfect OCR results.
So how does this apply to the '''Data Table's''' header row label? The short answer is it provides a way to increase the accuracy of '''Data Column''' header labels by "boosting" the similarity of the label to imperfect OCR results.


Line 793: Line 694:
# OCR made some missteps and recognized that segment as <code>TOFAL</code>.
# OCR made some missteps and recognized that segment as <code>TOFAL</code>.
#* The second "T" in "TOTAL" was recognized as an "F" character.
#* The second "T" in "TOTAL" was recognized as an "F" character.
#* This means "TOTAL" (the expected label) is one character's difference from "TOFAL" (the actual text data). Or, "TOFAL" is 80% similar to "TOTAL".
#* This means "TOTAL" (the expected label) is one character's difference from "TOFAL" (the actual text data). Or, "TOFAL" is 80% similar to "TOTAL".
#* The '''''Labeling Behavior's''''' similarity threshold is set to 90% for this '''Content Model'''. 80% is less than 90%. So, the result is thrown out.
#* The '''''Labeling Behavior's''''' similarity threshold is set to 90% for this '''Content Model'''. 80% is less than 90%. So, the result is thrown out.
#* FYI:  This threshold is configured when the '''''Labeling Behavior''''' is added, using the '''''Behaviors''''' property of a Content Model. The '''''Label Similarity''''' property is set to ''90%'' by default, but can be adjusted at any time.
#* FYI:  This threshold is configured when the '''''Labeling Behavior''''' is added, using the '''''Behaviors''''' property of a Content Model. The '''''Label Similarity''''' property is set to ''90%'' by default, but can be adjusted at any time.
Line 799: Line 700:


As we will see, capturing the full row of column header labels will boost the similarity, allowing the label to match without altering the '''''Labeling Behavior's''''' fuzzy match settings.
As we will see, capturing the full row of column header labels will boost the similarity, allowing the label to match without altering the '''''Labeling Behavior's''''' fuzzy match settings.
|valign=top|
 
[[File:2023_TabularLayout_008_Tabular-Layout-with-Label-Sets---Collect-Column-Labels_04.png]]
[[File:2023_TabularLayout_008_Tabular-Layout-with-Label-Sets---Collect-Column-Labels_04.png]]
|-
 
|valign=top|
 
<br>
# Here, we've collected a header row label for the '''Data Column'''.
# Here, we've collected a header row label for the '''Data Column'''.
#<li value = 2> Now the "Line Total" '''Data Column's''' label matches!  MAGIC!</li>
#<li value = 2> Now the "Line Total" '''Data Column's''' label matches!  MAGIC!</li>


Not magic. Just math.
Not magic. Just math.
Line 815: Line 716:
* If the label can be matched as a part of the larger whole, its confidence score goes up much further than by itself.  
* If the label can be matched as a part of the larger whole, its confidence score goes up much further than by itself.  
* The '''Data Table's''' larger label of the full row of column labels gives extra context to the "Line Items" '''Data Column''' label, providing more information about what is and is not an appropriate match.
* The '''Data Table's''' larger label of the full row of column labels gives extra context to the "Line Items" '''Data Column''' label, providing more information about what is and is not an appropriate match.
|valign=top|
 
[[File:2023_TabularLayout_008_Tabular-Layout-with-Label-Sets---Collect-Column-Labels_05.png]]
[[File:2023_TabularLayout_008_Tabular-Layout-with-Label-Sets---Collect-Column-Labels_05.png]]
|}
 


So why is it considered best practice to capture a header row label for the '''Data Table'''? OCR errors are unpredictable.  
So why is it considered best practice to capture a header row label for the '''Data Table'''? OCR errors are unpredictable.  


The set of examples you worked with when architecting this solution may have been fairly clean with good OCR reads. Maybe it didn't seem like you needed a '''Data Table''' label at the time, but that may not always be the case. Capturing a '''Data Table''' label for the header row will act as a safety net to avoid unforeseen problems in the future.
The set of examples you worked with when architecting this solution may have been fairly clean with good OCR reads. Maybe it didn't seem like you needed a '''Data Table''' label at the time, but that may not always be the case. Capturing a '''Data Table''' label for the header row will act as a safety net to avoid unforeseen problems in the future.
</tab>
<tab name="2. Assign a Data Column's Value Extractor" style="margin:20px">


=== 2. Assign a Data Column's Value Extractor ===
===== 2. Assign a Data Column's Value Extractor =====
This step is all about '''''row detection'''''.


This step is all about '''''row detection'''''.
So far all we've done is established header column positions on each document. But, that's not where the data is. The table's data is in the ''rows''.


So far all we've done is established header column positions on each document. But, that's not where the data is. The table's data is in the ''rows''.
As it stands, Grooper doesn't know anything about the rows in the tables. It doesn't know the size of each row. It doesn't know what kind of data is supposed to be in the rows. Maybe most importantly, it doesn't know ''how many'' rows there are. Tables tend to be dynamic. They may have 3 rows on one document and 300 on the next. Grooper needs a way of detecting this.


As it stands, Grooper doesn't know anything about the rows in the tables.  It doesn't know the size of each row.  It doesn't know what kind of data is supposed to be in the rows.  Maybe most importantly, it doesn't know ''how many'' rows there are.  Tables tend to be dynamic.  They may have 3 rows on one document and 300 on the next.  Grooper needs a way of detecting this.


{|cellpadding=10 cellspacing=5
To detect rows, we need at least one '''Data Column's''' '''''Value Extractor''''' property configured. For each result the extractor produces below the column's header, Grooper will create one row instance.
|valign=top style="width:40%"|
To detect rows, we need at least one '''Data Column's''' '''''Value Extractor''''' property configured. For each result the extractor produces below the column's header, Grooper will create one row instance.


The key thing to keep in mind is this data ''must'' be present on every row. You'll want to pick a column whos data is always present for every row, where it would be considered invalid if the information wasn't in that cell for a given row.
The key thing to keep in mind is this data ''must'' be present on every row. You'll want to pick a column whos data is always present for every row, where it would be considered invalid if the information wasn't in that cell for a given row.


In our case, we will choose the "Quantity" '''Data Column'''. We always expect (for the time being anyway) there to be a quantity listed for the line item on the invoice.
In our case, we will choose the "Quantity" '''Data Column'''. We always expect (for the time being anyway) there to be a quantity listed for the line item on the invoice.


# We will use this '''Value Reader''' for our demonstration.
# We will use this '''Value Reader''' for our demonstration.
#* However, in the real world, the extraction world is your oyster. You'll configure an extractor to best target the data in whatever table column you're trying to extract.
#* However, in the real world, the extraction world is your oyster. You'll configure an extractor to best target the data in whatever table column you're trying to extract.
# This is a fairly simple '''''Pattern Match''''' extractor designed to return numeric data (including currency).
# This is a fairly simple '''''Pattern Match''''' extractor designed to return numeric data (including currency).
# The regex is a fairly simple pattern to match generic quantities.
# The regex is a fairly simple pattern to match generic quantities.
#* It'll match decimal values from 0 and above with two decimal places optional.
#* It'll match decimal values from 0 and above with two decimal places optional.
# We've also edited our '''''Prefix''''' and '''''Suffix Patterns''''' so that the pattern must be surrounded by a space character before and after, with an optional dollar sign before the number.
# We've also edited our '''''Prefix''''' and '''''Suffix Patterns''''' so that the pattern must be surrounded by a space character before and after, with an optional dollar sign before the number.
# As you can see, we get five results below the "Quantity" label.
# As you can see, we get five results below the "Quantity" label.  
#* When we assign this '''Value Reader''' to the "Quantity" '''Data Column''', we should then get five rows when this table extracts.
#* When we assign this '''Value Reader''' to the "Quantity" '''Data Column''', we should then get five rows when this table extracts.
|valign=top|
 
[[File:2023_TabularLayout_009_Tabular-Layout-with-Label-Sets---Assign-a-Data-Column%27s-Value-Extractor_01.png]]
[[File:2023_TabularLayout_009_Tabular-Layout-with-Label-Sets---Assign-a-Data-Column%27s-Value-Extractor_01.png]]
|-
 
|valign=top|
 
<br>
We do get a bunch of other hits as well. This is a very generic extractor matching very generic numerical data.
We do get a bunch of other hits as well. This is a very generic extractor matching very generic numerical data.


# Will this result present a problem?  Will we get an extra row for its result?
# Will this result present a problem?  Will we get an extra row for its result?
#* No. That result is ''above'' the header label <code>HRS / QTY</code>.
#* No. That result is ''above'' the header label <code>HRS / QTY</code>.  
#* The '''''Tabular Layout''''' method presumes rows are ''below'' column labels. Any and all results above the first instance of the column's headers will be ignored.
#* The '''''Tabular Layout''''' method presumes rows are ''below'' column labels. Any and all results above the first instance of the column's headers will be ignored.
# What about these matching results on the same line?  Will the extra results create additional row instances?
# What about these matching results on the same line?  Will the extra results create additional row instances?
#* No. These results are misaligned with the "Quantity" '''Data Column's''' header. They are too far to the right to be considered under the column header. They will be ignored.
#* No. These results are misaligned with the "Quantity" '''Data Column's''' header. They are too far to the right to be considered under the column header. They will be ignored.
#* Only results aligned with the "Quantity" '''Data Column's''' header will create a row instance.
#* Only results aligned with the "Quantity" '''Data Column's''' header will create a row instance.
# What about these results?  Will they produce a row?
# What about these results?  Will they produce a row?
#* No. These results are also misaligned with the "Quantity" '''Data Column's''' header.
#* No. These results are also misaligned with the "Quantity" '''Data Column's''' header.
#* That said, if these ''were'' aligned with the "Quantity" '''Data Column's''' header, they ''would'' produce row instances.
#* That said, if these ''were'' aligned with the "Quantity" '''Data Column's''' header, they ''would'' produce row instances.
#* When you are building your own '''Data Column''' extractors, pay close attention to results below the column's header. They have the most potential to produce false positive results, producing erroneous rows.
#* When you are building your own '''Data Column''' extractors, pay close attention to results below the column's header. They have the most potential to produce false positive results, producing erroneous rows.
#** That said, there are a multitude of ways to avoid false positive row results when using '''Data Columns'''' '''''Value Extractors''''' to detect rows. We will discuss this more in the [[#Advanced Setup Considerations]] portion of this article.
#** That said, there are a multitude of ways to avoid false positive row results when using '''Data Columns'''' '''''Value Extractors''''' to detect rows. We will discuss this more in the [[#Advanced Setup Considerations]] portion of this article.
|
 
[[File:2023_TabularLayout_009_Tabular-Layout-with-Label-Sets---Assign-a-Data-Column%27s-Value-Extractor_02.png]]
[[File:2023_TabularLayout_009_Tabular-Layout-with-Label-Sets---Assign-a-Data-Column%27s-Value-Extractor_02.png]]
|-
 
|valign=top|
 
<br>
With our extractor ready to go, all we need to do is assign it to the "Quantity" '''Data Column''' using its '''''Value Extractor''''' property.
With our extractor ready to go, all we need to do is assign it to the "Quantity" '''Data Column''' using its '''''Value Extractor''''' property.


Line 887: Line 782:
At ''bare minimum'' you must configure at least one '''Data Column's''' '''''Value Extractor''''' to perform row detection.
At ''bare minimum'' you must configure at least one '''Data Column's''' '''''Value Extractor''''' to perform row detection.


However, multiple columns may be used to perform row detection by configuring their corresponding '''Data Columns''' '''''Value Extractor''''' properties. For more information on using multiple columns in row detection (as well as row detection in general) please visit the <span style="background-color:white; padding:3px">[[#Advanced Row Detection]]</span> section of this article.
However, multiple columns may be used to perform row detection by configuring their corresponding '''Data Columns''' '''''Value Extractor''''' properties. For more information on using multiple columns in row detection (as well as row detection in general) please visit the <span style="background-color:white; padding:3px">[[#Advanced Row Detection]]</span> section of this article.
|}
|}
|valign=top|
 
[[File:2023_TabularLayout_009_Tabular-Layout-with-Label-Sets---Assign-a-Data-Column%27s-Value-Extractor_03.png]]
[[File:2023_TabularLayout_009_Tabular-Layout-with-Label-Sets---Assign-a-Data-Column%27s-Value-Extractor_03.png]]
|}


So far, we have:
So far, we have:
Line 898: Line 792:
# Configured at least one '''Data Column''' with its '''''Value Extractor''''' configured.
# Configured at least one '''Data Column''' with its '''''Value Extractor''''' configured.


For fairly simple table structures, we now have the two things the '''''Tabular Layout''''' method needs to extract data. Now, all we need to do is tell the '''Data Table''' object we want to use the '''''Tabular Layout''''' method. We do this by setting its '''''Extract Method''''' property to ''Tabular Layout''.
For fairly simple table structures, we now have the two things the '''''Tabular Layout''''' method needs to extract data. Now, all we need to do is tell the '''Data Table''' object we want to use the '''''Tabular Layout''''' method. We do this by setting its '''''Extract Method''''' property to ''Tabular Layout''.
</tab>
<tab name="3. Set Extract Method to Tabular Layout and Test" style="margin:20px">
 
{|cellpadding=10 cellspacing=5
|valign=top style="width:40%"|
=== 3. Set Extract Method to Tabular Layout ===


A '''Data Table's''' extraction method is set using the '''''Extract Method''''' property. To enable the '''''Tabular Layout''''' method, do the following.
===== 3. Set Extract Method to Tabular Layout =====
A '''Data Table's''' extraction method is set using the '''''Extract Method''''' property. To enable the '''''Tabular Layout''''' method, do the following.


# Select a '''Data Table''' object in your '''Data Model'''.
# Select a '''Data Table''' object in your '''Data Model'''.
Line 912: Line 801:
# Select the '''''Extract Method''''' property.
# Select the '''''Extract Method''''' property.
# Using the dropdown menu, select ''Tabular Layout''
# Using the dropdown menu, select ''Tabular Layout''
|
 
[[File:2023_TabularLayout_010_Tabular-Layout-with-Label-Sets---Set-Extract-Method-to-Tabular-Layout-and-Test_01.png]]
[[File:2023_TabularLayout_010_Tabular-Layout-with-Label-Sets---Set-Extract-Method-to-Tabular-Layout-and-Test_01.png]]
|-
|valign=top|
=== 4. Test ===


===== 4. Test =====
Now, let's test out what we have and see what we get!
Now, let's test out what we have and see what we get!


Line 924: Line 811:
# The results show up in the "Data Element Preview" window.
# The results show up in the "Data Element Preview" window.
#* Success!  Our table's data is collected!
#* Success!  Our table's data is collected!
|valign=top|
 
[[File:2023_TabularLayout_010_Tabular-Layout-with-Label-Sets---Set-Extract-Method-to-Tabular-Layout-and-Test_02.png]]
[[File:2023_TabularLayout_010_Tabular-Layout-with-Label-Sets---Set-Extract-Method-to-Tabular-Layout-and-Test_02.png]]
|-
 
|valign=top|
 
<br>
So, how was Grooper able to do this? For the ''Tabular Layout'' method, the '''Data Table''' is populated using primarily two pieces of information: column header locations established by the '''Data Columns'''' labels and rows locations detected by a '''Data Column's''' '''''Value Extractor'''''.
So, how was Grooper able to do this? For the ''Tabular Layout'' method, the '''Data Table''' is populated using primarily two pieces of information: column header locations established by the '''Data Columns'''' labels and rows locations detected by a '''Data Column's''' '''''Value Extractor'''''.
* Remember, we collected labels for ''all'' '''Data Columns'''. We configured ''only'' the "Quantity" '''Data Column's''' '''''Value Extractor''''''.
* Remember, we collected labels for ''all'' '''Data Columns'''. We configured ''only'' the "Quantity" '''Data Column's''' '''''Value Extractor''''''.


First, it's all about establishing column headers.
First, it's all about establishing column headers.
# The '''Data Columns'''' labels established the column locations for each column.
# The '''Data Columns'''' labels established the column locations for each column.
# Grooper then determines the ''width'' of these columns.
# Grooper then determines the ''width'' of these columns.
#* If table lines are present, Grooper can detect those line locations via a '''Line Detection''' (or '''Line Removal''') '''IP Command'''. Grooper will "snap" the column's width to the detected line boundaries, expanding the cell's width (and height) to the boundaries around it.
#* If table lines are present, Grooper can detect those line locations via a '''Line Detection''' (or '''Line Removal''') '''IP Command'''. Grooper will "snap" the column's width to the detected line boundaries, expanding the cell's width (and height) to the boundaries around it.
#** Table lines give human readers an indicator of where the data "lives" (or is contained). If it's in the box, it belongs to the column. If it's out of the box, it belongs to a different column.
#** Table lines give human readers an indicator of where the data "lives" (or is contained). If it's in the box, it belongs to the column. If it's out of the box, it belongs to a different column.
#* If table lines are ''not'' present (as is the case for this document), Grooper performs a variety of gutter-detection operations, analyzing the whitespace between columns to determine their widths.
#* If table lines are ''not'' present (as is the case for this document), Grooper performs a variety of gutter-detection operations, analyzing the whitespace between columns to determine their widths.
#** ''Most commonly'' Grooper will average the distance between one header label and the next.
 
|valign=top|
[[File:2021-tabular-layout-without-label-sets-14.png]]
[[File:2021-tabular-layout-without-label-sets-14.png]]
|-
 
|valign=top|
 
<br>
Second, it's all about detecting rows. Rows are detected using a '''Data Column's''' '''''Value Extractor'''''.
Second, it's all about detecting rows. Rows are detected using a '''Data Column's''' '''''Value Extractor'''''.
* In our case, we configured the "Quantity" '''Data Column's''' '''''Value Extractor'''''.
* In our case, we configured the "Quantity" '''Data Column's''' '''''Value Extractor'''''.
* FYI:  When a '''Data Column's''' extractor is used to detect rows, it is considered "Primary Extraction". A '''Data Column's''' extractor can also be used for "Secondary Extraction", performed ''after'' rows are detected. For more on this, please visit the [[#Primary VS Secondary Extraction]] section of this article.
* FYI:  When a '''Data Column's''' extractor is used to detect rows, it is considered "Primary Extraction". A '''Data Column's''' extractor can also be used for "Secondary Extraction", performed ''after'' rows are detected. For more on this, please visit the [[#Primary VS Secondary Extraction]] section of this article.


# Rows are only detected below the detecting '''Data Column's''' header.
# Rows are only detected below the detecting '''Data Column's''' header.
Line 952: Line 836:
# For each result returned, Grooper establishes one row instance.
# For each result returned, Grooper establishes one row instance.
#* Since our extractor was designed to return decimal values, and Grooper found five decimal values below our column header, Grooper detected five rows.
#* Since our extractor was designed to return decimal values, and Grooper found five decimal values below our column header, Grooper detected five rows.
|valign=top|
 
[[File:2021-tabular-layout-without-label-sets-15.png]]
[[File:2021-tabular-layout-without-label-sets-15.png]]
|-
 
|valign=top|
 
<br>
The ''Tabular Layout'' method now has the two pieces of information it needs to determine the table's structure. If you know where the columns are and how big they are, and you know how many rows there are, you pretty much know what the table looks like. Grooper can infer the table's grid-like structure using the column and row positions.
The ''Tabular Layout'' method now has the two pieces of information it needs to determine the table's structure. If you know where the columns are and how big they are, and you know how many rows there are, you pretty much know what the table looks like. Grooper can infer the table's grid-like structure using the column and row positions.


# It has column instances for each '''Data Column'''.
# It has column instances for each '''Data Column'''.
Line 963: Line 846:
# It has row instances for each detected row.
# It has row instances for each detected row.
#* Again, established by the detecting '''Data Column's''' '''''Value Extractor'''''.
#* Again, established by the detecting '''Data Column's''' '''''Value Extractor'''''.
#** FYI:  More than one '''Data Column''' can be used to detect rows. Please visit the [[#Advanced Row Detection]] section for more information.
#** FYI:  More than one '''Data Column''' can be used to detect rows. Please visit the [[#Advanced Row Detection]] section for more information.
|valign=top|
 
[[File:2021-tabular-layout-without-label-sets-16.png]]
[[File:2021-tabular-layout-without-label-sets-16.png]]
|-
 
|valign=top|
 
<br>
With these column and row instances established, '''Grooper''' can form data instances for each cell of the table.
With these column and row instances established, '''Grooper''' can form data instances for each cell of the table.


#<li value=3> Each cell's data simply lays where the columns and rows intersect.</li>
#<li value=3> Each cell's data simply lays where the columns and rows intersect.</li>
#* For '''Data Columns''' ''with'' their '''''Value Extractors''''' configured, values are either collected using "Primary" or "Secondary Extraction". Please see the [[#Primary VS Secondary Extraction]] portion for more information.
#* For '''Data Columns''' ''with'' their '''''Value Extractors''''' configured, values are either collected using "Primary" or "Secondary Extraction". Please see the [[#Primary VS Secondary Extraction]] portion for more information.
#* For '''Data Columns''' ''without'' their '''''Value Extractors''''' configured, values are collected by returning the OCR or native text data within the geometric boundaries of the cell.
#* For '''Data Columns''' ''without'' their '''''Value Extractors''''' configured, values are collected by returning the OCR or native text data within the geometric boundaries of the cell.
#** This is ''extremely'' beneficial for data that is difficult to extract using pattern matching.
#** This is ''extremely'' beneficial for data that is difficult to extract using pattern matching.
#** For example, invoice item numbers and descriptions are notoriously difficult to pattern match. By using something in the table that ''is'' easy to pattern match, like our item quantities, we can use '''''Tabular Layout''''' to model the table structure and collect the other column values that are ''not''.
#** For example, invoice item numbers and descriptions are notoriously difficult to pattern match. By using something in the table that ''is'' easy to pattern match, like our item quantities, we can use '''''Tabular Layout''''' to model the table structure and collect the other column values that are ''not''.
|
 
[[File:2021-tabular-layout-without-label-sets-17.png]]
[[File:2021-tabular-layout-without-label-sets-17.png]]
|}
</tab>
:[[#Tabular Layout Without Label Sets|Click here to return to the top]]
</tabs>


==== Label Padding ====
==== Label Padding ====
When collecting labels for '''Data Columns''' the physical width of the label will help establish the width of the column. Grooper uses a variety of information on the page such as distance between column labels, whitespace gutters between the text in columns, line location data stored to a page's layout data to establish the width of a column.


When collecting labels for '''Data Columns''' the physical width of the label will help establish the width of the column.  Grooper uses a variety of information on the page such as distance between column labels, whitespace gutters between the text in columns, line location data stored to a page's layout data to establish the width of a column.
However, Grooper doesn't ''always'' get things right. In these cases, you can manually adjust the width of a column using the '''''Padding''''' properties of the '''Data Column's''' '''''Header''''' label.


However, Grooper doesn't ''always'' get things right.  In these cases, you can manually adjust the width of a column using the '''''Padding''''' properties of the '''Data Column's''' '''''Header''''' label.


{|cellpadding=10 cellspacing=5
For example, take this line items table. Imagine we're using the "Line Total" column for row detection.
|valign=top style="width:50%"|
<br>
For example, take this line items table. Imagine we're using the "Line Total" column for row detection.


# If the column instance is limited to the width of label <code>Line Total</code>, the "Line Total" '''Data Column's''' extractor will ''never'' return a result. No text falls within the boundaries of the column.
# If the column instance is limited to the width of label <code>Line Total</code>, the "Line Total" '''Data Column's''' extractor will ''never'' return a result. No text falls within the boundaries of the column.
# The values for the column are misaligned with the columns header.
# The values for the column are misaligned with the columns header.
|valign=top|
 
[[File:2021-tabular-layout-padding-01.png]]
[[File:2021-tabular-layout-padding-01.png]]
|-
 
|valign=top|
 
<br>
Under normal circumstances, we simply couldn't use this column for row detection.
Under normal circumstances, we simply couldn't use this column for row detection.


# However, using the '''''Padding''''' property, we can adjust the size of a '''Data Element's''' label (in this case the '''Data Column's''' '''''Header''''' label).
# However, using the '''''Padding''''' property, we can adjust the size of a '''Data Element's''' label (in this case the '''Data Column's''' '''''Header''''' label).
# This will adjust the width of the column instance, aligning the column's values within the boundaries of the column, allowing this column to be used for row detection.
# This will adjust the width of the column instance, aligning the column's values within the boundaries of the column, allowing this column to be used for row detection.
|valign=top|
 
[[File:2021-tabular-layout-padding-02.png]]
[[File:2021-tabular-layout-padding-02.png]]
|}


{|cellpadding=10 cellspacing=5
 
|valign=top style="width:40%"|
<br>
# To adjust a label's '''''Padding''''', first select the label whose width and/or height you wish to adjust.
# To adjust a label's '''''Padding''''', first select the label whose width and/or height you wish to adjust.
#* We have selected the "Anfoneb" '''Document Type's''' "Line Total" '''Data Column's''' label.
#* We have selected the "Anfoneb" '''Document Type's''' "Line Total" '''Data Column's''' label.
# In our case we want to lengthen this <code>Line Total</code> label.
# In our case we want to lengthen this <code>Line Total</code> label.
#* This will lengthen our column width, allowing the Line Total column's values to be used for row detection.
#* This will lengthen our column width, allowing the Line Total column's values to be used for row detection.
|valign=top|
 
[[File:2023_TabularLayout_011_Tabular-Layout-with-Label-Sets---Label-Padding_01.png]]
[[File:2023_TabularLayout_011_Tabular-Layout-with-Label-Sets---Label-Padding_01.png]]
|-
 
|valign=top|
 
<br>
# Expand the '''''Padding''''' property.
# Expand the '''''Padding''''' property.
# Use the '''''Left''''', '''''Right''''', '''''Top''''', and/or '''''Bottom''''' properties to adjust the size of the label.
# Use the '''''Left''''', '''''Right''''', '''''Top''''', and/or '''''Bottom''''' properties to adjust the size of the label.
# We entered ''0.5in'' for the '''''Right''''' padding property.
# We entered ''0.5in'' for the '''''Right''''' padding property.
#* This extended the width of our label 0.5 inches to the right.
#* This extended the width of our label 0.5 inches to the right.
# Our line total values now fall below the "Line Items" label. The "Line Items" column can now be used for row detection.
# Our line total values now fall below the "Line Items" label. The "Line Items" column can now be used for row detection.
|valign=top|
 
[[File:2023_TabularLayout_011_Tabular-Layout-with-Label-Sets---Label-Padding_02.png]]
[[File:2023_TabularLayout_011_Tabular-Layout-with-Label-Sets---Label-Padding_02.png]]
|-
 
|valign=top|
 
<br>
#<li value=5> Success! Now that we adjusted the width of our "Line Items" '''Data Column's''' label, the table extracts successfully.</li>
#<li value=5> Success! Now that we adjusted the width of our "Line Items" '''Data Column's''' label, the table extracts successfully.</li>


Line 1,038: Line 907:
'''FYI'''
'''FYI'''
|
|
You may have noticed we did not pad the label to reach the true "end" of column. Rather, the width just barely overlapped with the currency values in the column.
You may have noticed we did not pad the label to reach the true "end" of column. Rather, the width just barely overlapped with the currency values in the column.


We were able to get away with this because we were using the column for ''row detection''. The "Line Items" '''Data Column's''' extractor was using Primary Extraction to find these values, collect them, and detect rows all at the same time.
We were able to get away with this because we were using the column for ''row detection''. The "Line Items" '''Data Column's''' extractor was using Primary Extraction to find these values, collect them, and detect rows all at the same time.


Were this column using Secondary Extraction to collect the columns values, it's most likely we ''would'' need to further pad out the column header so that it does extend the full width of the column.
Were this column using Secondary Extraction to collect the columns values, it's most likely we ''would'' need to further pad out the column header so that it does extend the full width of the column.
Line 1,047: Line 916:
|}
|}


|valign=top|
[[File:2023_TabularLayout_011_Tabular-Layout-with-Label-Sets---Label-Padding_03.png]]
[[File:2023_TabularLayout_011_Tabular-Layout-with-Label-Sets---Label-Padding_03.png]]
|}
 


==== Table Labels and Labelset Based Classification ====
==== Table Labels and Labelset Based Classification ====
Table headers are often very useful (even critical) for '''''Labelset-Based''''' classification, and it generally is the case you want to use them as a classification feature. Currently, if you want to use a '''Data Table''' object's labels for classification, you must set the '''Data Table's''' '''''Minimum Row Count''''' property to ''at least'' "1". This is a known issue in the current version of '''Grooper''' and likely will change.


Table headers are often very useful (even critical) for '''''Labelset-Based''''' classification, and it generally is the case you want to use them as a classification feature.  Currently, if you want to use a '''Data Table''' object's labels for classification, you must set the '''Data Table's''' '''''Minimum Row Count''''' property to ''at least'' "1".  This is a known issue in the current version of '''Grooper''' and likely will change.


{|cellpadding=10 cellspacing=5
|valign=top style="width:40%"|
<br>
However, if you find '''Data Table''' and/or '''Data Column''' labels are not included in determining document similarity during classification, do the following:
However, if you find '''Data Table''' and/or '''Data Column''' labels are not included in determining document similarity during classification, do the following:


Line 1,066: Line 931:


If you have multiple '''Data Table''' objects in your '''Data Model''', you will need to repeat these steps for each one.
If you have multiple '''Data Table''' objects in your '''Data Model''', you will need to repeat these steps for each one.
|
 
[[File:2023_TabularLayout_012_Tabular-Layout-with-Label-Sets---Table-Labels-and-Labelset-Based-Classification.png]]
[[File:2023_TabularLayout_012_Tabular-Layout-with-Label-Sets---Table-Labels-and-Labelset-Based-Classification.png]]
|}
 


For more information on the '''''Labelset-Based''''' document classification method, visit the [[Label Sets]] article.
For more information on the '''''Labelset-Based''''' document classification method, visit the [[Label Sets]] article.


== Advanced Setup Considerations ==
== Advanced Setup Considerations ==
 
The '''''Tabular Layout''''' method is designed to extract tabular data even with the most basic setup described above. However, sometimes "basic" just isn't enough.
The '''''Tabular Layout''''' method is designed to extract tabular data even with the most basic setup described above. However, sometimes "basic" just isn't enough.


The challenging part of table extraction is the variety of forms a table can take. Columns can be in various orders. Table cells can be spaced well apart or jam-packed tight together. Sometimes data is required to be present for some table formats but it's optional on others. There's little consistency in how columns are labeled. Multiline row data can be challenging to target.
The challenging part of table extraction is the variety of forms a table can take. Columns can be in various orders. Table cells can be spaced well apart or jam-packed tight together. Sometimes data is required to be present for some table formats but it's optional on others. There's little consistency in how columns are labeled. Multiline row data can be challenging to target.


Grooper's '''''Tabular Layout''''' method has ways to overcome these issues, and more. For more complicated table structures, the '''''Tabular Layout''''' method has a robust suite of configurable properties. Understanding these properties will allow you to better extract a wider variety of tabular data.
Grooper's '''''Tabular Layout''''' method has ways to overcome these issues, and more. For more complicated table structures, the '''''Tabular Layout''''' method has a robust suite of configurable properties. Understanding these properties will allow you to better extract a wider variety of tabular data.


In this section, we will discus the following advanced setup features for '''''Tabular Layout''''':
In this section, we will discus the following advanced setup features for '''''Tabular Layout''''':
Line 1,090: Line 954:
* We will continue testing table extraction using the "Line Items" '''Data Table''' from the [[#Basic Setup]] instructions.
* We will continue testing table extraction using the "Line Items" '''Data Table''' from the [[#Basic Setup]] instructions.
* Column headers have already been established (either using Label Sets or '''''Header Extractors''''')
* Column headers have already been established (either using Label Sets or '''''Header Extractors''''')
* The "Quantity" '''Data Column''' is performing row detection. It's '''''Value Extractor''''' has been configured as described in the [[#Basic Setup]]
* The "Quantity" '''Data Column''' is performing row detection. It's '''''Value Extractor''''' has been configured as described in the [[#Basic Setup]]
* Line location layout data has been collected for all documents.
* Line location layout data has been collected for all documents.


=== Multiline Rows ===
=== Multiline Rows ===
For many documents, the data in each row of a table occupies a single line.


{|cellpadding=10 cellspacing=5
The table we used in our [[#Basic Setup]] instructions had single-line rows. Indeed, single-line table structures are more basic and are typically the easiest to extract.
|valign=top style="width:40%"|
<br>
For many documents, the data in each row of a table occupies a single line.


The table we used in our [[#Basic Setup]] instructions had single-line rows.  Indeed, single-line table structures are more basic and are typically the easiest to extract.
|valign=top|
[[File:2021-tabular-layout-multiline-about-01.png]]
[[File:2021-tabular-layout-multiline-about-01.png]]
|-
 
|valign=top|
 
<br>
Multiline table structures are a little trickier.
Multiline table structures are a little trickier.


In multiline tables, the data in one or more columns can span multiple lines. For example, the "Description" column in this table spans multiple lines (four to be exact).
In multiline tables, the data in one or more columns can span multiple lines. For example, the "Description" column in this table spans multiple lines (four to be exact).


This can pose a challenge for table extraction, particularly for tables with unpredictable line wrapping where sometimes a row may be single-line and others may be multiline.
This can pose a challenge for table extraction, particularly for tables with unpredictable line wrapping where sometimes a row may be single-line and others may be multiline.
Line 1,114: Line 973:


But, have no fear! The '''''Tabular Layout''''' method can easily detect most multiline table structures by enabling the '''''Multiline Rows''''' property.
But, have no fear! The '''''Tabular Layout''''' method can easily detect most multiline table structures by enabling the '''''Multiline Rows''''' property.
|valign=top|
 
[[File:2021-tabular-layout-multiline-about-02.png]]
[[File:2021-tabular-layout-multiline-about-02.png]]
|}


{|cellpadding=10 cellspacing=5
 
|valign=top style="width:40%"|
<br>
The default '''''Tabular Layout''''' settings presume all rows are single-line.
The default '''''Tabular Layout''''' settings presume all rows are single-line.


Line 1,126: Line 982:
# Upon testing extraction, note ''only'' the first line for each row in the "Description" column is collected.
# Upon testing extraction, note ''only'' the first line for each row in the "Description" column is collected.
# The remaining three lines in the "Description" cells are ignored.
# The remaining three lines in the "Description" cells are ignored.
|
 
[[File:2023_TabularLayout_013_Advanced-Setup-Considerations_01.png]]
[[File:2023_TabularLayout_013_Advanced-Setup-Considerations_01.png]]
|-
 
|valign=top|
 
<br>
This is what the '''''Multiline Rows''''' property is for. Enabling this property will allow you to target table structures like this whose rows extend beyond just a single line on the page.
This is what the '''''Multiline Rows''''' property is for. Enabling this property will allow you to target table structures like this whose rows extend beyond just a single line on the page.


# To enable '''''Multiline Rows''''', first expand the '''''Tabular Layout''''' sub-properties.
# To enable '''''Multiline Rows''''', first expand the '''''Tabular Layout''''' sub-properties.
# Switch the '''''Multiline Rows''''' property to ''Enabled''.
# Switch the '''''Multiline Rows''''' property to ''Enabled''.
|
 
[[File:2023_TabularLayout_013_Advanced-Setup-Considerations_02.png]]
[[File:2023_TabularLayout_013_Advanced-Setup-Considerations_02.png]]
|-
 
|valign=top|
 
<br>
# <li value=3> The '''''Tabular Layout''''' method now appropriately detects the rows occupy multiple lines on the document.</li>
# <li value=3> The '''''Tabular Layout''''' method now appropriately detects the rows occupy multiple lines on the document.</li>
# The full line item description is now properly extracted by the '''Data Table'''.
# The full line item description is now properly extracted by the '''Data Table'''.
|
 
[[File:2023_TabularLayout_013_Advanced-Setup-Considerations_03.png]]
[[File:2023_TabularLayout_013_Advanced-Setup-Considerations_03.png]]
|-
 
|valign=top|
 
The '''''Multiline Rows''''' functionality will even detect multiline rows if the lines start on one page and continue to the next.
The '''''Multiline Rows''''' functionality will even detect multiline rows if the lines start on one page and continue to the next.


# Make sure '''''Multiline Rows''''' is enabled.
# Make sure '''''Multiline Rows''''' is enabled.
# In the subproperties of '''''Multiline Rows''''', set the '''''Detect Page Wrap''''' property to true.
# In the subproperties of '''''Multiline Rows''''', set the '''''Detect Page Wrap''''' property to true.
|
 
[[File:2023_TabularLayout_013_Advanced-Setup-Considerations_04.png]]
[[File:2023_TabularLayout_013_Advanced-Setup-Considerations_04.png]]
|}


==== Detect Stacked Layout ====
==== Detect Stacked Layout ====
There is a special variety of multiline structured tables called a "stacked layout" table. In these tables, you will find two different pieces of information stacked on top of one another in the same column.
There is a special variety of multiline structured tables called a "stacked layout" table. In these tables, you will find two different pieces of information stacked on top of one another in the same column.
 


{|cellpadding=10 cellspacing=5
|valign=top style="width:40%"|
<br>
For example, in this table, the "Item Number" and "Description" column headers are both contained within the same column, with "Item Number" and stacked on top of "Description".
For example, in this table, the "Item Number" and "Description" column headers are both contained within the same column, with "Item Number" and stacked on top of "Description".
* "Item Number" is highlighted in orange.
* "Item Number" is highlighted in orange.
* "Description" is highlighted in yellow.
* "Description" is highlighted in yellow.
|valign=top|
 
[[File:2021-tabular-layout-mulitline-stacked-about-01.png]]
[[File:2021-tabular-layout-mulitline-stacked-about-01.png]]
|-
 
|valign=top|
 
<br>
Their corresponding values are ''also'' stacked on top of each other in each row. The item numbers in each row are stacked on top of the description from that item.
Their corresponding values are ''also'' stacked on top of each other in each row. The item numbers in each row are stacked on top of the description from that item.
* The item number values are highlighted in orange.
* The item number values are highlighted in orange.
* The item description values are highlighted in yellow.
* The item description values are highlighted in yellow.
|valign=top|
 
[[File:2021-tabular-layout-mulitline-stacked-about-02.png]]
[[File:2021-tabular-layout-mulitline-stacked-about-02.png]]
|-
 
|valign=top|
 
<br>
In these situations, the '''''Detect Stacked Layout''''' property can help get the right values in the right columns with no additional extraction configuration.
In these situations, the '''''Detect Stacked Layout''''' property can help get the right values in the right columns with no additional extraction configuration.
|valign=top|
 
[[File:2021-tabular-layout-mulitline-stacked-about-03.png]]
[[File:2021-tabular-layout-mulitline-stacked-about-03.png]]
|}


{|cellpadding=10 cellspacing=5
 
|valign=top style="width:40%"|
<br>
With '''''Multiline Rows''''' enabled, you can choose to enable or disable the '''''Detect Stacked Layout''''' property.
With '''''Multiline Rows''''' enabled, you can choose to enable or disable the '''''Detect Stacked Layout''''' property.


# '''''Detect Stacked Layout''''' is ''Disabled'' by default.
# '''''Detect Stacked Layout''''' is ''Disabled'' by default.
|valign=top|
 
[[File:2023_TabularLayout_014_Advanced-Setup-Considerations_Detect-Stacked-Layout_01.png]]
[[File:2023_TabularLayout_014_Advanced-Setup-Considerations_Detect-Stacked-Layout_01.png]]
|-
 
|valign=top|
 
<br>
Here, we are using the default configuration with '''''Multiline Rows''''' enabled.
Here, we are using the default configuration with '''''Multiline Rows''''' enabled.
# <li value=2> The "Envoy" '''Document Type''' is a good candidate for the '''''Detect Stacked Layout''''' feature.</li>
# <li value=2> The "Envoy" '''Document Type''' is a good candidate for the '''''Detect Stacked Layout''''' feature.</li>
Line 1,199: Line 1,044:


These two header labels are stacked on top of each other, as is their data in each row.
These two header labels are stacked on top of each other, as is their data in each row.
|valign=top|
 
[[File:2023_TabularLayout_014_Advanced-Setup-Considerations_Detect-Stacked-Layout_02.png]]
[[File:2023_TabularLayout_014_Advanced-Setup-Considerations_Detect-Stacked-Layout_02.png]]
|-
 
|valign=top|
 
<br>
Without '''''Detect Stacked Layout''''' enabled, we've got some problems.
Without '''''Detect Stacked Layout''''' enabled, we've got some problems.


# This is the normal '''''Multiline Rows''''' behavior.
# This is the normal '''''Multiline Rows''''' behavior.
#* Grooper determined correctly these rows spanned multiple lines. The cell is populated with ''all'' lines.
#* Grooper determined correctly these rows spanned multiple lines. The cell is populated with ''all'' lines.
#* However, this is not what we want.
#* However, this is not what we want.
# For each row, the first line (and only the first line) should be in to the "Item Number" column.
# For each row, the first line (and only the first line) should be in to the "Item Number" column.
Line 1,214: Line 1,058:


Because the "Item Number" header is stacked on top of the "Description" header, we can presume the first line belongs in the "Item Number" column and the second belongs in the "Description" column.
Because the "Item Number" header is stacked on top of the "Description" header, we can presume the first line belongs in the "Item Number" column and the second belongs in the "Description" column.
|valign=top|
 
[[File:2023_TabularLayout_014_Advanced-Setup-Considerations_Detect-Stacked-Layout_03.png]]
[[File:2023_TabularLayout_014_Advanced-Setup-Considerations_Detect-Stacked-Layout_03.png]]
|-
 
|valign=top|
 
<br>
The '''''Detect Stacked Layout''''' property will put the data from the appropriate line into the appropriate column according to how the labels are stacked.
The '''''Detect Stacked Layout''''' property will put the data from the appropriate line into the appropriate column according to how the labels are stacked.


# To enable '''''Detect Stacked Layout''''' expand the '''''Multiline Rows''''' sub-properties.
# To enable '''''Detect Stacked Layout''''' expand the '''''Multiline Rows''''' sub-properties.
# Change '''''Detect Stacked Layout''''' to ''True''.
# Change '''''Detect Stacked Layout''''' to ''True''.
|valign=top|
 
[[File:2023_TabularLayout_014_Advanced-Setup-Considerations_Detect-Stacked-Layout_04.png]]
[[File:2023_TabularLayout_014_Advanced-Setup-Considerations_Detect-Stacked-Layout_04.png]]
|-
 
|valign=top|
 
<br>
# <li value=3> Now, only the first line is collected for the "Item Number" column.</li>
# <li value=3> Now, only the first line is collected for the "Item Number" column.</li>
# And, only the second line is collected for the "Description" column.
# And, only the second line is collected for the "Description" column.
|valign=top|
 
[[File:2023_TabularLayout_014_Advanced-Setup-Considerations_Detect-Stacked-Layout_05.png]]
[[File:2023_TabularLayout_014_Advanced-Setup-Considerations_Detect-Stacked-Layout_05.png]]
|-
 
|valign=top|
 
<br>
{|class="fyi-box"
{|class="fyi-box"
|-
|-
Line 1,240: Line 1,081:
'''FYI'''
'''FYI'''
|
|
This would have been a very good situation for Data Element Overrides. Indeed, given '''''Tabular Layout's''''' multitude of configuration options, most users will find themselves using multiple '''Document Types''' and '''Data Element Overrides''' to fine tune extraction logic based on a variety of table formats.
This would have been a very good situation for Data Element Overrides. Indeed, given '''''Tabular Layout's''''' multitude of configuration options, most users will find themselves using multiple '''Document Types''' and '''Data Element Overrides''' to fine tune extraction logic based on a variety of table formats.
|}
|}


Given that this "Envoy" '''Document Type''' is the only one who can make use of the '''''Detect Stacked Layout''''' functionality, we '''''really''''' should have made this configuration using '''Data Element Overrides'''. This will prevent unintended consequences on other '''Document Types''' where the '''''Detect Stacked Layout''''' feature does ''not'' provide a benefit (or impedes accurate extraction).
Given that this "Envoy" '''Document Type''' is the only one who can make use of the '''''Detect Stacked Layout''''' functionality, we '''''really''''' should have made this configuration using '''Data Element Overrides'''. This will prevent unintended consequences on other '''Document Types''' where the '''''Detect Stacked Layout''''' feature does ''not'' provide a benefit (or impedes accurate extraction).


We should have enabled '''''Detect Stacked Layout''''' as an override performing the following steps:
We should have enabled '''''Detect Stacked Layout''''' as an override performing the following steps:
Line 1,253: Line 1,094:
# Turn the '''''Detect Stacked Layout''''' property to ''True''.
# Turn the '''''Detect Stacked Layout''''' property to ''True''.


By enabling '''''Detect Stacked Layout''''' using the "Envoy" '''Document Type's''' overrides, it will ensure ''only'' document's classified as "Envoy" will use the configuration.


By enabling '''''Detect Stacked Layout''''' using the "Envoy" '''Document Type's''' overrides, it will ensure ''only'' document's classified as "Envoy" will use the configuration.
|valign=top|
[[File:2023_TabularLayout_014_Advanced-Setup-Considerations_Detect-Stacked-Layout_06.png]]
[[File:2023_TabularLayout_014_Advanced-Setup-Considerations_Detect-Stacked-Layout_06.png]]
|}


=== Advanced Row Detection ===
=== Advanced Row Detection ===
A '''Data Column's''' '''''Value Extractor''''' is going to extract data in one of two ways:
A '''Data Column's''' '''''Value Extractor''''' is going to extract data in one of two ways:
# Primary Extraction
# Primary Extraction
#* Primary Extraction is for '''row detection'''. In this case, the extractor runs at the document level, looking for potential rows beneath the '''Data Column's''' header.
#* Primary Extraction is for '''row detection'''. In this case, the extractor runs at the document level, looking for potential rows beneath the '''Data Column's''' header.
# Secondary Extraction
# Secondary Extraction
#* Secondary Extraction happens ''after'' rows are detected. ''After'' row instances are formed, ''After'' cell instances are formed. In this case, the extractor runs at the ''instance'' level to further parse table cell or row data.
#* Secondary Extraction happens ''after'' rows are detected. ''After'' row instances are formed, ''After'' cell instances are formed. In this case, the extractor runs at the ''instance'' level to further parse table cell or row data.


This section is all about Primary Extraction (We'll talk more about the differences between Primary and Secondary Extraction in the [[#Primary VS Secondary Extraction]] section). This section is all about using '''Data Column''' extractors to locate and form row instances.
This section is all about Primary Extraction (We'll talk more about the differences between Primary and Secondary Extraction in the [[#Primary VS Secondary Extraction]] section). This section is all about using '''Data Column''' extractors to locate and form row instances.


In the [[#Basic Setup]] section, we demonstrated a simple example of how a single '''Data Column's''' extractor detects rows. However, more complicated table structures require more complicated solutions.
In the [[#Basic Setup]] section, we demonstrated a simple example of how a single '''Data Column's''' extractor detects rows. However, more complicated table structures require more complicated solutions.


In this section we will discuss:
In this section we will discuss:
Line 1,279: Line 1,117:


==== Row Detection Using Multiple Columns  ====
==== Row Detection Using Multiple Columns  ====
Going back to our [[#Basic Setup]] example:  Why did we use the "Quantity" '''Data Column''' for row detection?
Going back to our [[#Basic Setup]] example:  Why did we use the "Quantity" '''Data Column''' for row detection?


Simple enough answer:  There were quantities present on every row. Plus, quantity values are a lot easier to pattern match than something like an item number or a description.
Simple enough answer:  There were quantities present on every row. Plus, quantity values are a lot easier to pattern match than something like an item number or a description.


However, we could have used other columns for row detection. For example, you'd expect there to be a "Unit Price" or "Line Total" value in the rows of line item table as well. And, currency values are about as easy to pattern match as quantity values.
However, we could have used other columns for row detection. For example, you'd expect there to be a "Unit Price" or "Line Total" value in the rows of line item table as well. And, currency values are about as easy to pattern match as quantity values.


You can use not just one but ''multiple'' column values to form row instances. This can be an effective way to throw out false positive rows. Using multiple columns to detect rows, you're effectively saying you need a value present in Column A '''''and''''' Column B to detect a row.
You can use not just one but ''multiple'' column values to form row instances. This can be an effective way to throw out false positive rows. Using multiple columns to detect rows, you're effectively saying you need a value present in Column A '''''and''''' Column B to detect a row.


{|class="attn-box"
{|class="attn-box"
Line 1,292: Line 1,129:
|⚠
|⚠
|
|
You can use as many columns as you need to detect rows. You can configure table extraction so that a value would need to be present in Column A and Column B and Column C and so on.
You can use as many columns as you need to detect rows. You can configure table extraction so that a value would need to be present in Column A and Column B and Column C and so on.


You can also configure '''''Tabular Layout''''' in such a way that columns can be optionally used to detect rows. You might have a situation where as long as a value is present in Column A '''''or''''' Column B the row should be considered valid and detected.
You can also configure '''''Tabular Layout''''' in such a way that columns can be optionally used to detect rows. You might have a situation where as long as a value is present in Column A '''''or''''' Column B the row should be considered valid and detected.


In either case, when using multiple columns to detect rows the '''''Minimum Cell Count''''' property becomes extremely important. Once you're finished with this section, please be sure to read <span style="background-color:white">[[#The Minimum Cell Count Property]]</span> section of this article for more information.
In either case, when using multiple columns to detect rows the '''''Minimum Cell Count''''' property becomes extremely important. Once you're finished with this section, please be sure to read <span style="background-color:white">[[#The Minimum Cell Count Property]]</span> section of this article for more information.
|}
|}


{|cellpadding=10 cellspacing=5
 
|valign=top style="width:40%"|
<br>
# For example, look at our initial results for this "Nama" '''Document Type'''.
# For example, look at our initial results for this "Nama" '''Document Type'''.
# As far as the '''''Tabular Layout''''' settings go, we've enabled '''''Multiline Rows''''' and that's it.
# As far as the '''''Tabular Layout''''' settings go, we've enabled '''''Multiline Rows''''' and that's it.
#* However, '''''Multiline Rows''''' is agnostic to row detection. It has nothing to do with ''detecting'' rows, only enlarging them to include wrapped lines ''between'' detected rows.
#* However, '''''Multiline Rows''''' is agnostic to row detection. It has nothing to do with ''detecting'' rows, only enlarging them to include wrapped lines ''between'' detected rows.
# Using the "Quantity" column alone for row detection, we have collected a false-positive row instance.
# Using the "Quantity" column alone for row detection, we have collected a false-positive row instance.
#* This row is not a valid row. We need to throw it out.
#* This row is not a valid row. We need to throw it out.
|valign=top|
 
[[File:2023_TabularLayout_015_Advanced-Row-Detection_Row-Detection-Using-Multiple-Columns_01.png]]
[[File:2023_TabularLayout_015_Advanced-Row-Detection_Row-Detection-Using-Multiple-Columns_01.png]]
|-
 
|valign=top|
 
<br>
Why did this happen?  It's because we used the "Quantity" column to detect rows.
Why did this happen?  It's because we used the "Quantity" column to detect rows.


Line 1,318: Line 1,152:
# When the extractor runs within the boundaries of the "Quantity" column, it certainly matches the three numeric quantities listed in the three table rows.
# When the extractor runs within the boundaries of the "Quantity" column, it certainly matches the three numeric quantities listed in the three table rows.
# However, it also matches this value below the table.
# However, it also matches this value below the table.
#* ''This'' is the result giving us the false positive row. Because the extractor returns a value within the boundaries of the detecting column, Grooper forms a row instance.
#* ''This'' is the result giving us the false positive row. Because the extractor returns a value within the boundaries of the detecting column, Grooper forms a row instance.
|valign=top|
 
[[File:2023_TabularLayout_015_Advanced-Row-Detection_Row-Detection-Using-Multiple-Columns_02.png]]
[[File:2023_TabularLayout_015_Advanced-Row-Detection_Row-Detection-Using-Multiple-Columns_02.png]]
|-
 
|valign=top|
 
<br>
If we use multiple columns to detect rows, we can avoid this issue.
If we use multiple columns to detect rows, we can avoid this issue.


Line 1,334: Line 1,167:
* Therefore, if we use ''both'' columns to detect rows, the false-positive result would be thrown out.
* Therefore, if we use ''both'' columns to detect rows, the false-positive result would be thrown out.


Furthermore, for all (or certainly most) invoice table formats, we would expect both unit price values and quantity values listed for each row. Configuring two-column row detection would not only help detect rows for this table format in particular, it's likely to help detect rows from other formats as well.
Furthermore, for all (or certainly most) invoice table formats, we would expect both unit price values and quantity values listed for each row. Configuring two-column row detection would not only help detect rows for this table format in particular, it's likely to help detect rows from other formats as well.
|valign=top|
 
[[File:2021-tabular-layout-row-detect-multi-col-03.png]]
[[File:2021-tabular-layout-row-detect-multi-col-03.png]]
|-
 
|valign=top|
 
<br>
# All we need to do is configure the "Unit Price" '''Data Column''' to perform row detection.
# All we need to do is configure the "Unit Price" '''Data Column''' to perform row detection.
# We will configure its '''''Value Extractor''''', referencing the '''Value Reader''' we saw earlier matching numeric/currency values.
# We will configure its '''''Value Extractor''''', referencing the '''Value Reader''' we saw earlier matching numeric/currency values.
|valign=top|
 
[[File:2023_TabularLayout_015_Advanced-Row-Detection_Row-Detection-Using-Multiple-Columns_03.png]]
[[File:2023_TabularLayout_015_Advanced-Row-Detection_Row-Detection-Using-Multiple-Columns_03.png]]
|-
 
|valign=top|
 
<br>
# With both the "Quantity" and "Unit Price" '''Data Column's''' '''''Value Extractor''''' properties configured, a value is required in ''both'' columns for a row to be detected.
# With both the "Quantity" and "Unit Price" '''Data Column's''' '''''Value Extractor''''' properties configured, a value is required in ''both'' columns for a row to be detected.
# This throws out our false-positive match from earlier, when only the "Quantity" '''Data Column's''' '''''Value Extractor''''' was configured.
# This throws out our false-positive match from earlier, when only the "Quantity" '''Data Column's''' '''''Value Extractor''''' was configured.
|valign=top|
 
[[File:2023_TabularLayout_015_Advanced-Row-Detection_Row-Detection-Using-Multiple-Columns_04.png]]
[[File:2023_TabularLayout_015_Advanced-Row-Detection_Row-Detection-Using-Multiple-Columns_04.png]]
|}


==== The Minimum Cell Count Property ====
==== The Minimum Cell Count Property ====
{|cellpadding=10 cellspacing=5
|valign=top style="width:40%"|
<br>
The '''''Minimum Cell Count''''' property is extremely important when using multiple columns to detect rows.
The '''''Minimum Cell Count''''' property is extremely important when using multiple columns to detect rows.
# In the '''''Tabular Layout''''' sub-properties, this property is located in '''''Row Detection''''' sub-properties.
# In the '''''Tabular Layout''''' sub-properties, this property is located in '''''Row Detection''''' sub-properties.
Line 1,363: Line 1,189:
#* This means a minimum of 3 columns values must be present in order to detect a row.
#* This means a minimum of 3 columns values must be present in order to detect a row.
#* So, if you have 5 '''Data Columns''' whose '''''Value Extractors''''' are configured, only 3 of their values would need to be present to detect the row and form a row instance.
#* So, if you have 5 '''Data Columns''' whose '''''Value Extractors''''' are configured, only 3 of their values would need to be present to detect the row and form a row instance.
|valign=top|
 
[[File:2023_TabularLayout_016_Advanced-Row-Detection_The-Minimum-Cell-Count-Property_01.png]]
[[File:2023_TabularLayout_016_Advanced-Row-Detection_The-Minimum-Cell-Count-Property_01.png]]
|-
 
|valign=top|
 
<br>
There is, however, a caveat if you have ''less than'' the minimum value of '''Data Columns''' with configured '''''Value Extractors'''''.
There is, however, a caveat if you have ''less than'' the minimum value of '''Data Columns''' with configured '''''Value Extractors'''''.
# For example, we currently only have two '''Data Columns''' with configured '''''Value Extractors'''''.
# For example, we currently only have two '''Data Columns''' with configured '''''Value Extractors'''''.
Line 1,374: Line 1,199:
# But, we're still collecting table data.
# But, we're still collecting table data.


Since only two '''Data Columns'''' extractors are configured, we don't actually reach the "minimum" of "3". The '''''Tabular Layout''''' method will account for this and still extract the table data, presuming a value from the two columns must be present out of the three possible "minimum" cells.
Since only two '''Data Columns'''' extractors are configured, we don't actually reach the "minimum" of "3". The '''''Tabular Layout''''' method will account for this and still extract the table data, presuming a value from the two columns must be present out of the three possible "minimum" cells.
* It's when you go ''over'' the minimum cell count value in terms of the number of '''Data Columns''' with configured '''''Value Extractors''''' that this property really comes into play.
* It's when you go ''over'' the minimum cell count value in terms of the number of '''Data Columns''' with configured '''''Value Extractors''''' that this property really comes into play.
|valign=top|
 
[[File:2023_TabularLayout_016_Advanced-Row-Detection_The-Minimum-Cell-Count-Property_02.png]]
[[File:2023_TabularLayout_016_Advanced-Row-Detection_The-Minimum-Cell-Count-Property_02.png]]
|-
 
|valign=top|
 
<br>
Next, we're going to look at the '''''Minimum Cell Count''''' property where the number of '''Data Columns''' with configured '''''Value Extractors''''' ''does'' exceed the minimum cell count value (or will eventually by the time we're done).
Next, we're going to look at the '''''Minimum Cell Count''''' property where the number of '''Data Columns''' with configured '''''Value Extractors''''' ''does'' exceed the minimum cell count value (or will eventually by the time we're done).


Line 1,387: Line 1,211:
# This invoice ''should'' have four rows.
# This invoice ''should'' have four rows.
# However, as configured currently with the "Quantity" and "Unit Price" '''Data Columns''' performing row detection, we're only detecting two rows.
# However, as configured currently with the "Quantity" and "Unit Price" '''Data Columns''' performing row detection, we're only detecting two rows.
|valign=top|
 
[[File:2023_TabularLayout_016_Advanced-Row-Detection_The-Minimum-Cell-Count-Property_03.png]]
[[File:2023_TabularLayout_016_Advanced-Row-Detection_The-Minimum-Cell-Count-Property_03.png]]
|-
 
|valign=top|
 
<br>
Furthermore, we've got another issue due to '''''Multiline Rows''''' being enabled.
Furthermore, we've got another issue due to '''''Multiline Rows''''' being enabled.


Line 1,399: Line 1,222:


All of this can be resolved with better row detection.
All of this can be resolved with better row detection.
|valign=top|
 
[[File:2023_TabularLayout_016_Advanced-Row-Detection_The-Minimum-Cell-Count-Property_04.png]]
[[File:2023_TabularLayout_016_Advanced-Row-Detection_The-Minimum-Cell-Count-Property_04.png]]
|-
 
|valign=top|
 
<br>
# First, lets fix the problem with the "Line Total" '''Data Column's''' value.  
# First, lets fix the problem with the "Line Total" '''Data Column's''' value.
# If we configure the "Line Total" '''Data Column's''' '''''Value Extractor''''', it will match the dollar amount in this row properly.
# If we configure the "Line Total" '''Data Column's''' '''''Value Extractor''''', it will match the dollar amount in this row properly.
# Here, we've configured the '''''Value Extractor''''' property to reference that same '''Value Reader''' matching numeric/currency amounts.
# Here, we've configured the '''''Value Extractor''''' property to reference that same '''Value Reader''' matching numeric/currency amounts.




Think about it. This is the text cell extracted for the extended price.
Think about it. This is the text cell extracted for the extended price.
<pre>
<pre>
40,700.00
40,700.00
Line 1,421: Line 1,243:
</pre>
</pre>


By configuring the "Line Total" '''Data Column's''' extractor, we've added one more rule to detect valid rows. In order for a row to be detected, ''all'' the following conditions must be met:
By configuring the "Line Total" '''Data Column's''' extractor, we've added one more rule to detect valid rows. In order for a row to be detected, ''all'' the following conditions must be met:
* You must have a matching result in the "Quantity" column.
* You must have a matching result in the "Quantity" column.
* You must have a matching result in the "Unit Price" column.
* You must have a matching result in the "Unit Price" column.
* You must have a matching result in the "Line Total" column.
* You must have a matching result in the "Line Total" column.
|valign=top|
 
[[File:2023_TabularLayout_016_Advanced-Row-Detection_The-Minimum-Cell-Count-Property_05.png]]
[[File:2023_TabularLayout_016_Advanced-Row-Detection_The-Minimum-Cell-Count-Property_05.png]]
|-
 
|valign=top|
 
<br>
# Now we've extracted the correct value for the "Line Total".
# Now we've extracted the correct value for the "Line Total".
# However, we're still only returning two rows.
# However, we're still only returning two rows.
# We're going to use the '''''Minimum Cell Count Property''''' to fix this.
# We're going to use the '''''Minimum Cell Count Property''''' to fix this.
|valign=top|
 
[[File:2023_TabularLayout_016_Advanced-Row-Detection_The-Minimum-Cell-Count-Property_06.png]]
[[File:2023_TabularLayout_016_Advanced-Row-Detection_The-Minimum-Cell-Count-Property_06.png]]
|-
 
|valign=top|
 
Because we now have three '''Data Columns''' whose '''''Value Extractors''''' are configured, we have met the met the minimum cell count of "3".
Because we now have three '''Data Columns''' whose '''''Value Extractors''''' are configured, we have met the met the minimum cell count of "3".


Line 1,442: Line 1,263:




The truth is this table structure is a little non-standard in two ways.
The truth is this table structure is a little non-standard in two ways.  
* Whereas this table lists a zero dollar amount in the "Line Total" (Extended Price) column, it leaves the cell ''blank'' in the "Unit Price" column. Since there's no value there, Grooper passes it over for row detection.
* Whereas this table lists a zero dollar amount in the "Line Total" (Extended Price) column, it leaves the cell ''blank'' in the "Unit Price" column. Since there's no value there, Grooper passes it over for row detection.
* While the shipping cost is listed in the table for this invoice, the "Quantity" (Qty Shp.) is left blank.
* While the shipping cost is listed in the table for this invoice, the "Quantity" (Qty Shp.) is left blank.


In both cases, one of the three column values required for detection are missing. However, in ''all'' cases two of the three values ''are'' present for each row. We can use the '''''Minimum Row Count''''' property to change our detection logic a bit.
In both cases, one of the three column values required for detection are missing. However, in ''all'' cases two of the three values ''are'' present for each row. We can use the '''''Minimum Row Count''''' property to change our detection logic a bit.
|valign=top|
 
[[File:2021-tabular-layout-row-detect-min-cell-07.png]]
[[File:2021-tabular-layout-row-detect-min-cell-07.png]]
|-
 
|valign=top|
 
<br>
# We can successfully extract every row in this table by dropping the '''''Minimum Cell Count''''' value to ''2''.
# We can successfully extract every row in this table by dropping the '''''Minimum Cell Count''''' value to ''2''.
#* Remember, we have three '''Data Columns'''' extractors configured, meaning three ''can potentially'' be used to detect rows.
#* Remember, we have three '''Data Columns'''' extractors configured, meaning three ''can potentially'' be used to detect rows.
Line 1,460: Line 1,280:
#** A row with a "Quantity" value and a "Line Total" value would be detected.
#** A row with a "Quantity" value and a "Line Total" value would be detected.
#** A row with a "Quantity" value, a "Unit Price" value, and a "Line Total" value would be detected.
#** A row with a "Quantity" value, a "Unit Price" value, and a "Line Total" value would be detected.
#** A row with a "Quantity" value alone?  Nope. Not a valid row. Doesn't meet the minimum of "2".
#** A row with a "Quantity" value alone?  Nope. Not a valid row. Doesn't meet the minimum of "2".
# With this change to our row detection logic, all four rows are collected.
# With this change to our row detection logic, all four rows are collected.
|valign=top|
 
[[File:2023_TabularLayout_016_Advanced-Row-Detection_The-Minimum-Cell-Count-Property_07.png]]
[[File:2023_TabularLayout_016_Advanced-Row-Detection_The-Minimum-Cell-Count-Property_07.png]]
|-
 
|valign=top|
 
<br>
{|class="fyi-box"
{|class="fyi-box"
|-
|-
Line 1,475: Line 1,294:


For '''''most''''' of our '''Document Types''' in this set, using our three '''Data Column''' extractors and a '''''Minimum Cell Count''''' of ''3'' actually works really well as far as row detection goes.
For '''''most''''' of our '''Document Types''' in this set, using our three '''Data Column''' extractors and a '''''Minimum Cell Count''''' of ''3'' actually works really well as far as row detection goes.
:&bull; The "Factura" '''Document Type''' doesn't fit the normal model. It works better with a '''''Minimum Cell Count''''' of ''2''.
:&bull; The "Factura" '''Document Type''' doesn't fit the normal model. It works better with a '''''Minimum Cell Count''''' of ''2''.
:&bull; Therefore, the adjustment to the '''''Minimum Cell Count''''' should be made in the "Factura" '''Document Type's''' overrides.
:&bull; Therefore, the adjustment to the '''''Minimum Cell Count''''' should be made in the "Factura" '''Document Type's''' overrides.
|}
|}
|valign=top|
 
[[File:2023_TabularLayout_016_Advanced-Row-Detection_The-Minimum-Cell-Count-Property_08.png]]
[[File:2023_TabularLayout_016_Advanced-Row-Detection_The-Minimum-Cell-Count-Property_08.png]]
|}


==== Row Detection Limitations with Multiline Rows ====
==== Row Detection Limitations with Multiline Rows ====
There is one strict limitation to Grooper's row detection when you're dealing with multiline rows. In order to detect a row, ALL values must be present on the ''same line''.
Tables with multiline rows generally exist in two flavors (or a Neapolitan combination of the two):


There is one strict limitation to Grooper's row detection when you're dealing with multiline rows.  In order to detect a row, ALL values must be present on the ''same line''.


Tables with multiline rows generally exist in two flavors (or a Neapolitan combination of the two):
{|cellpadding=10 cellspacing=5
|valign=top|
<br>
# Rows are multiline because the text within a cell wraps to the next line.
# Rows are multiline because the text within a cell wraps to the next line.
|
 
[[File:2021-tabular-layout-row-detect-limits-01.png]]
[[File:2021-tabular-layout-row-detect-limits-01.png]]
|-
 
|valign=top|
 
<br>
#<li value=2> Rows are multiline because the columns have a stacked layout.</li>
#<li value=2> Rows are multiline because the columns have a stacked layout.</li>
|
 
[[File:2021-tabular-layout-row-detect-limits-02.png]]
[[File:2021-tabular-layout-row-detect-limits-02.png]]
|}


There's a variety of ways Grooper handles stacked column data in multiline rows. We've already seen the '''''Multiline Rows''''' feature's '''''Detect Stacked Layout''''' option ([[#Detect Stacked Layout|See here for more details]]).
 
There's a variety of ways Grooper handles stacked column data in multiline rows. We've already seen the '''''Multiline Rows''''' feature's '''''Detect Stacked Layout''''' option ([[#Detect Stacked Layout|See here for more details]]).  
* FYI: We'll see more ways to handle data stacked within a table cell in the [[#Primary VS Secondary Extraction|Secondary Extraction]] portion of this article.
* FYI: We'll see more ways to handle data stacked within a table cell in the [[#Primary VS Secondary Extraction|Secondary Extraction]] portion of this article.




However, you should always keep in mind the '''''Multiline Rows''''' feature has absolutely nothing to do with ''detecting'' rows. Grooper must detect a row ''first'' before it implements the '''''Multiline Row''''' logic to expand the row instance across multiple lines of text. For tables with a stacked column layout, row detection can prove challenging if you are using multiple '''Data Columns''' to detect rows using data on ''separate lines''.
However, you should always keep in mind the '''''Multiline Rows''''' feature has absolutely nothing to do with ''detecting'' rows. Grooper must detect a row ''first'' before it implements the '''''Multiline Row''''' logic to expand the row instance across multiple lines of text. For tables with a stacked column layout, row detection can prove challenging if you are using multiple '''Data Columns''' to detect rows using data on ''separate lines''.
* '''In order to detect a row, ALL values must be present ''on the same line'''''.
* '''In order to detect a row, ALL values must be present ''on the same line'''''.


{|cellpadding=10 cellspacing=5
 
|valign=top style="width:40%"|
<br>
For example, take this invoice line items table format with stacked columns.
For example, take this invoice line items table format with stacked columns.


Line 1,517: Line 1,330:
# The "Unit Price" '''Data Column''', labeled as <code>UNIT PRICE</code> here.
# The "Unit Price" '''Data Column''', labeled as <code>UNIT PRICE</code> here.
# The "Line Total" '''Data Column''', labeled as <code>TOTAL</code>
# The "Line Total" '''Data Column''', labeled as <code>TOTAL</code>
|valign=top|
 
[[File:2021-tabular-layout-row-detect-limits-03.png]]
[[File:2021-tabular-layout-row-detect-limits-03.png]]
|-
 
|valign=top|
 
<br>
The problem, as far as row detection goes, is two of these column values are on the same line, but one is on a separate line.
The problem, as far as row detection goes, is two of these column values are on the same line, but one is on a separate line.


Line 1,529: Line 1,341:


Grooper will '''''not''''' be able to detect rows (and therefore won't collect table data) as we have '''''Tabular Layout''''' configured currently.
Grooper will '''''not''''' be able to detect rows (and therefore won't collect table data) as we have '''''Tabular Layout''''' configured currently.
|valign=top|
 
[[File:2021-tabular-layout-row-detect-limits-04.png]]
[[File:2021-tabular-layout-row-detect-limits-04.png]]
|-
 
|valign=top|
 
<br>
If we try to extract this table, as configured, we will get no results whatsoever (because no rows are detected).
If we try to extract this table, as configured, we will get no results whatsoever (because no rows are detected).


Line 1,541: Line 1,352:
# <span style="color:white; background-color:#36b0a7; padding-left:5px; padding-right:5px">FYI</span> Enabling '''''Detect Stacked Layout''''' has nothing to do with row detection.
# <span style="color:white; background-color:#36b0a7; padding-left:5px; padding-right:5px">FYI</span> Enabling '''''Detect Stacked Layout''''' has nothing to do with row detection.
#* These properties will be helpful in modeling the row structure, but won't do anything if we're not detecting rows in the first place!
#* These properties will be helpful in modeling the row structure, but won't do anything if we're not detecting rows in the first place!
|valign=top|
 
[[File:2023_TabularLayout_017_Advanced-Row-Detection_Row-Detection-Limitations-with-Multiline-Rows_01.png]]
[[File:2023_TabularLayout_017_Advanced-Row-Detection_Row-Detection-Limitations-with-Multiline-Rows_01.png]]
|}
 


How are we going to fix this?  There's two ways we could approach this problem:
How are we going to fix this?  There's two ways we could approach this problem:
# By adjusting the '''''Row Detection > Minimum Cell Count''''' property.
# By adjusting the '''''Row Detection > Minimum Cell Count''''' property.
#* As we've seen before, when you adjust this property, such that the number is ''less'' than the number of '''Data Columns''' with configured '''''Value Extractors''''', it makes '''Data Columns''' optional when it comes to row detection.
#* As we've seen before, when you adjust this property, such that the number is ''less'' than the number of '''Data Columns''' with configured '''''Value Extractors''''', it makes '''Data Columns''' optional when it comes to row detection.
#* If we lowered this to ''2'', only two of our three columns would be required for row detection. The "Quantity" and "Line Total" columns' values are on the same line. Therefore, we would detect our rows.
#* If we lowered this to ''2'', only two of our three columns would be required for row detection. The "Quantity" and "Line Total" columns' values are on the same line. Therefore, we would detect our rows.
# By disabling row detection for the "Unit Price" '''Data Column'''.
# By disabling row detection for the "Unit Price" '''Data Column'''.
#* This may sound whacky, but it will be highly effective for our situation here. What's the problem here?  Row detection due to a stacked column layout. Specifically, one '''Data Column's''' value is on the second line of the row (the "Unit Price" column).
#* This may sound whacky, but it will be highly effective for our situation here. What's the problem here?  Row detection due to a stacked column layout. Specifically, one '''Data Column's''' value is on the second line of the row (the "Unit Price" column).
#* However, we have data we can use for detection on the first line (the "Quantity" and "Line Total" columns).
#* However, we have data we can use for detection on the first line (the "Quantity" and "Line Total" columns).
#* All we have to do is tell '''''Tabular Layout''''', "Don't use the "Unit Price" column's extractor to detect rows.", and we will start to collect our table data.
#* All we have to do is tell '''''Tabular Layout''''', "Don't use the "Unit Price" column's extractor to detect rows.", and we will start to collect our table data.
#** FYI: You might already be asking yourself "If we disable the column's extractor for row detection, why don't we just remove it?"  That's because we ''are'' going to use it. For Secondary Extraction. After we talk about disabling a '''Data Column's''' extractor for row detection, this will lead us into a discussion about '''''Tabular Layout's''''' Secondary Extraction capabilities in the [[#Primary VS Secondary Extraction]] section.
#** FYI: You might already be asking yourself "If we disable the column's extractor for row detection, why don't we just remove it?"  That's because we ''are'' going to use it. For Secondary Extraction. After we talk about disabling a '''Data Column's''' extractor for row detection, this will lead us into a discussion about '''''Tabular Layout's''''' Secondary Extraction capabilities in the [[#Primary VS Secondary Extraction]] section.


==== Disabling Row Detection ====
==== Disabling Row Detection ====
The previous example is a good one to point out how to ''disable'' row detection for a specific '''Data Column''' (and why you'd want to in the first place).


The previous example is a good one to point out how to ''disable'' row detection for a specific '''Data Column''' (and why you'd want to in the first place).


{|cellpadding=10 cellspacing=5|
|valign=top|
To recap:
To recap:


Line 1,569: Line 1,378:


Because the values exist on different lines, '''''Tabular Layout'''' ''cannot'' detect the rows.
Because the values exist on different lines, '''''Tabular Layout'''' ''cannot'' detect the rows.
|valign=top|
 
[[File:2021-tabular-layout-disable-row-detect-01.png]]
[[File:2021-tabular-layout-disable-row-detect-01.png]]
|-
 
|valign=top|
 
<br>
However, if we ''only'' used the "Quantity" and "Line Total" columns to detect rows, we would have no issue.
However, if we ''only'' used the "Quantity" and "Line Total" columns to detect rows, we would have no issue.
* The "Quantity" and "Line Total" '''Data Columns'''' '''''Value Extractor''''' configurations would detect the rows.
* The "Quantity" and "Line Total" '''Data Columns'''' '''''Value Extractor''''' configurations would detect the rows.
Line 1,580: Line 1,388:


All we need to do is disable row detection for the "Unit Price" '''Data Column''', using the '''''Tabular Layout''''' method's '''''Column Settings''''' properties.
All we need to do is disable row detection for the "Unit Price" '''Data Column''', using the '''''Tabular Layout''''' method's '''''Column Settings''''' properties.
|valing=top|
 
[[File:2021-tabular-layout-disable-row-detect-02.png]]
[[File:2021-tabular-layout-disable-row-detect-02.png]]
|}


Generally speaking, once you start configuring the '''''Column Settings''''' properties, you're doing so because you have a large number of table formats represented by a large number of '''Document Types'''. In most cases, you will adjust these properties per '''Document Type''' using '''Data Element Overrides'''.
 
Generally speaking, once you start configuring the '''''Column Settings''''' properties, you're doing so because you have a large number of table formats represented by a large number of '''Document Types'''. In most cases, you will adjust these properties per '''Document Type''' using '''Data Element Overrides'''.


Going forward, when adjusting the '''''Column Settings''''' in this tutorial, we will do so using a '''Document Type's''' overrides instead of configuring the global '''Data Table''' object.
Going forward, when adjusting the '''''Column Settings''''' in this tutorial, we will do so using a '''Document Type's''' overrides instead of configuring the global '''Data Table''' object.


{|cellpadding=10 cellspacing=5
 
|valign=top style="width:40%"|
<br>
# We will demonstrate disabling row detection by disabling the "Unit Price" '''Data Column's''' row detection for the "Daftari" '''Document Type'''.
# We will demonstrate disabling row detection by disabling the "Unit Price" '''Data Column's''' row detection for the "Daftari" '''Document Type'''.
# Navigate to the "Overrides" tab to override the '''Data Table's''' configuration for the selected '''Document Type'''.
# Navigate to the "Overrides" tab to override the '''Data Table's''' configuration for the selected '''Document Type'''.
Line 1,597: Line 1,403:
# Select the '''''Column Settings''''' property.
# Select the '''''Column Settings''''' property.
# Press the ellipsis button at the end of the property.
# Press the ellipsis button at the end of the property.
|valign=top|
 
[[File:2023_TabularLayout_018_Advanced_Row-Detection_Disabling-Row-Detection_01.png]]
[[File:2023_TabularLayout_018_Advanced_Row-Detection_Disabling-Row-Detection_01.png]]
|-
 
|valign=top|
 
<br>
# This will bring up the '''''Column Settings''''' editor.
# This will bring up the '''''Column Settings''''' editor.
# The '''''Column''''' column lists the '''Data Columns''' in your '''Data Table'''. Select the '''Data Column''' you wish to configure.
# The '''''Column''''' column lists the '''Data Columns''' in your '''Data Table'''. Select the '''Data Column''' you wish to configure.
#* In our case we want to disable row detection for the "Unit Price" '''Data Column'''.
#* In our case we want to disable row detection for the "Unit Price" '''Data Column'''.
# To disable row detection for the selected '''Data Column''', change the '''''Row Detection''''' property to ''Disabled''.
# To disable row detection for the selected '''Data Column''', change the '''''Row Detection''''' property to ''Disabled''.
#* This will prevent the '''Data Column's''' '''''Value Extractor''''' from performing Primary Extraction, forcing it to use Secondary Extraction instead. For more on Secondary Extraction, visit the [[#Primary VS Secondary Extraction]] portion of the article.
#* This will prevent the '''Data Column's''' '''''Value Extractor''''' from performing Primary Extraction, forcing it to use Secondary Extraction instead. For more on Secondary Extraction, visit the [[#Primary VS Secondary Extraction]] portion of the article.
# Press ''OK'' when finished.
# Press ''OK'' when finished.
|valign=top|
 
[[File:2023_TabularLayout_018_Advanced-Row-Detection_Disabling-Row-Detection_02.png]]
[[File:2023_TabularLayout_018_Advanced-Row-Detection_Disabling-Row-Detection_02.png]]
|-
 
|valign=top|
 
<br>
# With '''''Row Detection''''' ''Disabled'' for the "Unit Price" '''Data Column''' in the '''''Column Settings''''', Grooper can now detect rows for this table format.
# With '''''Row Detection''''' ''Disabled'' for the "Unit Price" '''Data Column''' in the '''''Column Settings''''', Grooper can now detect rows for this table format.
# Grooper successfully detects the three rows present on the document.
# Grooper successfully detects the three rows present on the document.
Line 1,619: Line 1,423:
#* This is at least a better problem than the one we had before.
#* This is at least a better problem than the one we had before.
#* Previously, we weren't getting ''any'' data for ''any'' columns in ''any'' rows.
#* Previously, we weren't getting ''any'' data for ''any'' columns in ''any'' rows.
#* Now, we're at least have row instances to work with and we're getting most of our table data. Furthermore, the data we want is contained within the cell. We just need a way of extracting it.
#* Now, we're at least have row instances to work with and we're getting most of our table data. Furthermore, the data we want is contained within the cell. We just need a way of extracting it.
#** With the data we want present in each cell, we can extract the data (the unit price currency listed) using Secondary Extraction.
#** With the data we want present in each cell, we can extract the data (the unit price currency listed) using Secondary Extraction.
|valign=top|
 
[[File:2023_TabularLayout_018_Advanced-Row-Detection_Disabling-Row-Detection_03.png]]
[[File:2023_TabularLayout_018_Advanced-Row-Detection_Disabling-Row-Detection_03.png]]
|}
 


{|class="fyi-box"
{|class="fyi-box"
Line 1,636: Line 1,440:
|}
|}


''Optional'' is the default setting. This means the '''Data Column''' will be used for row detection, but is not required.
''Optional'' is the default setting. This means the '''Data Column''' will be used for row detection, but is not required.
:&bull; Imagine your '''Data Table's''' '''''Row Detection > Minimum Cell Count''''' property is set to ''3'' and you have ''5'' '''Data Columns''' whos '''''Column Settings > Row Detection''''' properties are set to ''Optional''.
:&bull; Imagine your '''Data Table's''' '''''Row Detection > Minimum Cell Count''''' property is set to ''3'' and you have ''5'' '''Data Columns''' whos '''''Column Settings > Row Detection''''' properties are set to ''Optional''.  
::&bull; If all five of those '''Data Columns''' extractors produced results on a line, the row would be detected.
::&bull; If all five of those '''Data Columns''' extractors produced results on a line, the row would be detected.
::&bull; If ''any'' two of those '''Data Columns''' extractors failed to produce a result, but the other three did return a result, the row would still be detected.
::&bull; If ''any'' two of those '''Data Columns''' extractors failed to produce a result, but the other three did return a result, the row would still be detected.
:::&bull; An optional '''Data Column''' can potentially be used for row detection, but if it fails to return a value, the row can still be detected. As long as enough other '''Data Columns''' produce results (such that the number of '''Data Columns''' returning a result meets the '''''Minimum Cell Count''''' value), the row will be detected.
:::&bull; An optional '''Data Column''' can potentially be used for row detection, but if it fails to return a value, the row can still be detected. As long as enough other '''Data Columns''' produce results (such that the number of '''Data Columns''' returning a result meets the '''''Minimum Cell Count''''' value), the row will be detected.
:::&bull; Refer to [[#The Minimum Cell Count Property|this section of the article]] for more information on how the minimum cell count effects row detection.
:::&bull; Refer to [[#The Minimum Cell Count Property|this section of the article]] for more information on how the minimum cell count effects row detection.


Line 1,655: Line 1,459:


=== Primary VS Secondary Extraction ===
=== Primary VS Secondary Extraction ===
Primary Extraction and Secondary Extraction refers to how a '''Data Column's''' '''''Value Extractor''''' extracts table data.
Primary Extraction and Secondary Extraction refers to how a '''Data Column's''' '''''Value Extractor''''' extracts table data.


Line 1,663: Line 1,466:
# DATA INSTANCES
# DATA INSTANCES


{|cellpadding=10 cellspacing=5
 
|valign=top style="width:50%"|
It really all boils down to data instances. The '''''Tabular Layout''''' method subdivides a table into data instances in a variety of ways: first into column instances, second into row instances and third into cell instances. At the end of the process, Grooper has everything it needs to collect data using these sub-instances.
<br>
 
It really all boils down to data instances. The '''''Tabular Layout''''' method subdivides a table into data instances in a variety of ways: first into column instances, second into row instances and third into cell instances. At the end of the process, Grooper has everything it needs to collect data using these sub-instances.
|valign=top|
[[File:2021-tabular-layout-secondary-extract-01.png]]
[[File:2021-tabular-layout-secondary-extract-01.png]]
|-
 
|valign=top|
 
<br>
For '''Primary Extraction''', the '''Data Column's''' extractor executes within the ''column instance''.
For '''Primary Extraction''', the '''Data Column's''' extractor executes within the ''column instance''.
* Primary Extraction is utilized for row detection, which is the process of forming row instances.
* Primary Extraction is utilized for row detection, which is the process of forming row instances.
|valign=top|
 
[[File:2021-tabular-layout-secondary-extract-02.png]]
[[File:2021-tabular-layout-secondary-extract-02.png]]
|-
 
|valign=top|
 
<br>
For '''Secondary Extraction''', data is collected from the table using the instances established ''after'' rows are detected. This is done in one of following ways:
For '''Secondary Extraction''', data is collected from the table using the instances established ''after'' rows are detected. This is done in one of following ways:
* The '''Data Column's''' extractor executes within the ''cell instance''.
* The '''Data Column's''' extractor executes within the ''cell instance''.
** Secondary Extraction is employed to parse data within a cell, ''after'' rows are detected and the table's structure is established.
** Secondary Extraction is employed to parse data within a cell, ''after'' rows are detected and the table's structure is established.
Line 1,686: Line 1,485:
* Less commonly, the '''Data Column's''' extractor executes within the whole row instance.
* Less commonly, the '''Data Column's''' extractor executes within the whole row instance.
** Secondary Extraction can also be configured in such a way that extraction occurs at the row-level rather than the cell-level.
** Secondary Extraction can also be configured in such a way that extraction occurs at the row-level rather than the cell-level.
|valign=top|
 
[[File:2021-tabular-layout-secondary-extract-03.png]]
[[File:2021-tabular-layout-secondary-extract-03.png]]
|-
 
|valign=top|
 
<br>
Secondary Extraction is useful for further parsing table data once rows have already been detected and cell and row instances are formed.
Secondary Extraction is useful for further parsing table data once rows have already been detected and cell and row instances are formed.


For example, we had an issue in the [[#Disabling Row Detection|previous section]] where rows were detected but column's value were not extracted correctly.
For example, we had an issue in the [[#Disabling Row Detection|previous section]] where rows were detected but column's value were not extracted correctly.


Due to an issue with the table's stacked column structure, we ''couldn't'' use the "Unit Price" '''Data Column''' for row detection. So, we disabled row detection for that column in the '''''Column Settings''''' properties. This prevented the '''Data Column''' from performing Primary Extraction.
Due to an issue with the table's stacked column structure, we ''couldn't'' use the "Unit Price" '''Data Column''' for row detection. So, we disabled row detection for that column in the '''''Column Settings''''' properties. This prevented the '''Data Column''' from performing Primary Extraction.


Instead, it is falling back on Secondary Extraction.
Instead, it is falling back on Secondary Extraction.


Secondary Extract will attempt to execute the '''Data Column's''' '''''Value Extractor''''' ''inside the cell instance'' rather than the column instance. If that extractor fails to return a result, the entire text within the geometric boundaries of the cell is returned instead.
Secondary Extract will attempt to execute the '''Data Column's''' '''''Value Extractor''''' ''inside the cell instance'' rather than the column instance. If that extractor fails to return a result, the entire text within the geometric boundaries of the cell is returned instead.


{|cellpadding=10 cellspacing=5
{|cellpadding=10 cellspacing=5
|valign=top|
|valign=top|
Currently, we're simply returning all the text within each cell for each cell for each row for the "Unit Price" column. This isn't what we want to collect.
Currently, we're simply returning all the text within each cell for each cell for each row for the "Unit Price" column. This isn't what we want to collect.
|valign=top|
|valign=top|
However, the value we do want (the dollar amount) is fully encapsulated within the cell. We just need to extract it from the text present in the cell.
However, the value we do want (the dollar amount) is fully encapsulated within the cell. We just need to extract it from the text present in the cell.
|-
|-
|valign=top|
|valign=top|
Line 1,713: Line 1,511:
|}
|}


|valign=top|
[[File:2023_TabularLayout_019_Primary-vs-Secondary-Extract_01.png]]
[[File:2023_TabularLayout_019_Primary-vs-Secondary-Extract_01.png]]
|}


{|cellpadding=10 cellspacing=5
 
|valign=top style="width:40%"|
<br>
# The "Unit Price" '''Data Column''' does currently have its '''''Value Extractor''''' configured.
# The "Unit Price" '''Data Column''' does currently have its '''''Value Extractor''''' configured.
# It's using our same generic numeric/currency extractor we've been using through this article to match numeric values.
# It's using our same generic numeric/currency extractor we've been using through this article to match numeric values.
#* All we need to do is ensure this extractor can property extract data from the cell.
#* All we need to do is ensure this extractor can property extract data from the cell.
|valign=top|
 
[[File:2023_TabularLayout_019_Primary-vs-Secondary-Extract_02.png]]
[[File:2023_TabularLayout_019_Primary-vs-Secondary-Extract_02.png]]
|-
 
|valign=top|
 
<br>
This brings up a common issue when performing Secondary Extraction. Always be aware of the instance-level you are extracting.
This brings up a common issue when performing Secondary Extraction. Always be aware of the instance-level you are extracting.


# This is the extractor the "Unit Price" '''Data Column''' references.
# This is the extractor the "Unit Price" '''Data Column''' references.
Line 1,734: Line 1,527:
# However, there is an issue with this '''''Suffix Pattern''''' when the extractor runs Secondary Extraction in the "Unit Price" column's cells.
# However, there is an issue with this '''''Suffix Pattern''''' when the extractor runs Secondary Extraction in the "Unit Price" column's cells.


When run globally on the document, it would make sense to expect a space character after the number. However, when you get down to the cell instances for the "Unit Price" column, there ''is no space character''.
When run globally on the document, it would make sense to expect a space character after the number. However, when you get down to the cell instances for the "Unit Price" column, there ''is no space character''.
 
[[File:2021-tabular-layout-secondary-extract-09.png|center]]
[[File:2021-tabular-layout-secondary-extract-09.png|center]]
<br>
 
The '''''Suffix Pattern''''' doesn't match within the cell. Instead of there being a space character present, there's just nothing. The text data terminates at end of the number itself. When run using Secondary Extraction, this extractor fails to produce a result.
 
|valign=top|
The '''''Suffix Pattern''''' doesn't match within the cell. Instead of there being a space character present, there's just nothing. The text data terminates at end of the number itself. When run using Secondary Extraction, this extractor fails to produce a result.
 
[[File:2023_TabularLayout_019_Primary-vs-Secondary-Extract_03.png]]
[[File:2023_TabularLayout_019_Primary-vs-Secondary-Extract_03.png]]
|-
 
|valign=top|
 
<br>
We just need to update this extractor so that it will match within the cell during Secondary Extraction.
We just need to update this extractor so that it will match within the cell during Secondary Extraction.


Line 1,748: Line 1,542:
#* <code>\s|$</code>
#* <code>\s|$</code>
# <span style="color:white; background-color:#36b0a7; padding-left:5px; padding-right:5px">'''FYI'''</span> It's very common to use end of string characters <code>$</code> in your '''''Suffix Patterns''''' as well as beginning of string characters <code>^</code> in your '''''Prefix Patterns''''' when relying on Secondary Extraction.
# <span style="color:white; background-color:#36b0a7; padding-left:5px; padding-right:5px">'''FYI'''</span> It's very common to use end of string characters <code>$</code> in your '''''Suffix Patterns''''' as well as beginning of string characters <code>^</code> in your '''''Prefix Patterns''''' when relying on Secondary Extraction.
|valign=top|
 
[[File:2023_TabularLayout_019_Primary-vs-Secondary-Extract_04.png]]
[[File:2023_TabularLayout_019_Primary-vs-Secondary-Extract_04.png]]
|-
 
|valign=top|
 
<br>
With this minor change to the extraction logic, the extractor will now property execute with in the cell whenever Secondary Extraction is performed.
With this minor change to the extraction logic, the extractor will now property execute with in the cell whenever Secondary Extraction is performed.


Line 1,758: Line 1,551:
# The "Unit Price" '''Data Column's''' '''''Value Extractor''''' runs during Secondary Extraction, executing against the cell instance ''after'' rows are detected.
# The "Unit Price" '''Data Column's''' '''''Value Extractor''''' runs during Secondary Extraction, executing against the cell instance ''after'' rows are detected.
#* Now that we adjusted the extractor to match within the cell instance, we get the value we want.
#* Now that we adjusted the extractor to match within the cell instance, we get the value we want.
|valign=top|
 
[[File:2023_TabularLayout_019_Primary-vs-Secondary-Extract_05.png]]
[[File:2023_TabularLayout_019_Primary-vs-Secondary-Extract_05.png]]
|}


==== <span style="color:#662d91; font-size:115%">Secondary Extract Modes</code> ====
==== Secondary Extract Modes ====
{|cellpadding=10 cellspacing=5
There are three ways in which Secondary Extraction can be performed, called '''''Secondary Extract Modes'''''. These modes can be configured '''Data Column''' by '''Data Column''' using the '''''Tabular Layout > Column Settings > Secondary Extract Mode''''' settings.
|valign=top style="width:50%"|
There are three ways in which Secondary Extraction can be performed, called '''''Secondary Extract Modes'''''. These modes can be configured '''Data Column''' by '''Data Column''' using the '''''Tabular Layout > Column Settings > Secondary Extract Mode''''' settings.
# ''Cell Extract''
# ''Cell Extract''
#* For the ''Cell Extract'' mode, the '''Data Column's''' '''''Value Extractor''''' executes within the table cell's text contents.
#* For the ''Cell Extract'' mode, the '''Data Column's''' '''''Value Extractor''''' executes within the table cell's text contents.
Line 1,776: Line 1,566:
# ''Row Extract''
# ''Row Extract''
#* The ''Row Extract'' mode executes the '''Data Column's''' extractor against the full text of the row instance (''not'' the cell instance).
#* The ''Row Extract'' mode executes the '''Data Column's''' extractor against the full text of the row instance (''not'' the cell instance).
#* This is the least common '''''Secondary Extract Mode'''''. Typically, this mode is used as a last resort due to atypical table structures.
#* This is the least common '''''Secondary Extract Mode'''''. Typically, this mode is used as a last resort due to atypical table structures.
|valgin=top|
 
[[File:2023_TabularLayout_020_Primary-vs-Secondary-Extract_Secondary-Extract-Modes_01.png]]
[[File:2023_TabularLayout_020_Primary-vs-Secondary-Extract_Secondary-Extract-Modes_01.png]]
|}


===== <span style="color:#662d91; font-size:115%">Auto VS Cell Extract VS Geometric</span> =====
===== Auto VS Cell Extract VS Geometric =====
 
The default value for the '''''Secondary Extract Mode''''' is ''Auto''. "Auto" will attempt to use the ''Cell Extract'' mode, but will fall back on the ''Geometric'' mode as a failsafe.
The default value for the '''''Secondary Extract Mode''''' is ''Auto''. "Auto" will attempt to use the ''Cell Extract'' mode, but will fall back on the ''Geometric'' mode as a failsafe.
* ''Auto'' first attempts to use ''Cell Extract''. If the '''Data Column's''' extractor returns a match within the cell, its result will be returned.
* ''Auto'' first attempts to use ''Cell Extract''. If the '''Data Column's''' extractor returns a match within the cell, its result will be returned.
* If the '''Data Column's''' extractor ''fails'' to return a match within the cell, ''Auto'' will use the ''Geometric'' mode. All text within the geometric boundaries of the cell will be returned.
* If the '''Data Column's''' extractor ''fails'' to return a match within the cell, ''Auto'' will use the ''Geometric'' mode. All text within the geometric boundaries of the cell will be returned.


This is exactly what happened in our previous example.
This is exactly what happened in our previous example.
Line 1,814: Line 1,602:
You can, however, force a '''Data Column''' to only ever use either ''Cell Extract'' or ''Geometric'' by configuring the '''''Table Extraction > Column Settings > Secondary Extract''''' mode property for one or more '''Data Columns'''.
You can, however, force a '''Data Column''' to only ever use either ''Cell Extract'' or ''Geometric'' by configuring the '''''Table Extraction > Column Settings > Secondary Extract''''' mode property for one or more '''Data Columns'''.


===== <span style="color:#662d91; font-size:115%">Row Extract</span> =====
===== Row Extract =====
 
The ''Row Extract'' mode allows you to execute a '''Data Column's''' extractor against the row instance rather than the cell instance. There are two main reasons to do this:
The ''Row Extract'' mode allows you to execute a '''Data Column's''' extractor against the row instance rather than the cell instance. There are two main reasons to do this:
# The table's structure is atypical and Grooper was not able to appropriately find the divisions between columns.
# The table's structure is atypical and Grooper was not able to appropriately find the divisions between columns.
# You need to extract data that is in each row but not labeled by a column header.
# You need to extract data that is in each row but not labeled by a column header.


In either case, it may be difficult (or even impossible) to extract the data you want out of a specific cell within a row. However, it may be possible to extract the data from the row itself.
In either case, it may be difficult (or even impossible) to extract the data you want out of a specific cell within a row. However, it may be possible to extract the data from the row itself.
 


{|cellpadding=10 cellspacing=5
|valign=top style="width:40%"|
<br>
For example, imagine we wanted to find the "Unit" column for our line item tables as well.
For example, imagine we wanted to find the "Unit" column for our line item tables as well.


The unit "EA" is listed clearly for each item in the row. However, there is no column header labeling this column. There's nothing like "Unit" or "Unit of Measure" or "UOM" we present labeling the column.
The unit "EA" is listed clearly for each item in the row. However, there is no column header labeling this column. There's nothing like "Unit" or "Unit of Measure" or "UOM" we present labeling the column.
|valign=top|
 
[[File:2021-tabular-layout-row-extract-01.png]]
[[File:2021-tabular-layout-row-extract-01.png]]
|-
 
|valign=top|
 
<br>
Furthermore, because this table has line layout data, neither the "Unit Price" nor the "Line Total" columns would ever contain this value within their cell instances for any row.
Furthermore, because this table has line layout data, neither the "Unit Price" nor the "Line Total" columns would ever contain this value within their cell instances for any row.
* Sometimes you can get away with using a different column's header label, even using the same one that's already been used by another '''Data Column'''. This will not be the case here.
* Sometimes you can get away with using a different column's header label, even using the same one that's already been used by another '''Data Column'''. This will not be the case here.
|valign=top|
 
[[File:2021-tabular-layout-row-extract-02.png]]
[[File:2021-tabular-layout-row-extract-02.png]]
|-
 
|valign=top|
 
<br>
However, the data is always present in each row, and Grooper easily detects each row in this table.
However, the data is always present in each row, and Grooper easily detects each row in this table.


We can still extract the unit of measure from the ''row instance'', using the ''Row Extract'' '''''Secondary Extract Mode'''''.
We can still extract the unit of measure from the ''row instance'', using the ''Row Extract'' '''''Secondary Extract Mode'''''.
|valign=top|
 
[[File:2021-tabular-layout-row-extract-03.png]]
[[File:2021-tabular-layout-row-extract-03.png]]
|-
 
|valign=top|
 
<br>
Next, we're going to configure '''''Tabular Layout''''' so that the "Racun" '''Document Type''' will use the ''Row Extract'' mode to extract the unit of measure value from each row in its invoices' line items tables.
Next, we're going to configure '''''Tabular Layout''''' so that the "Racun" '''Document Type''' will use the ''Row Extract'' mode to extract the unit of measure value from each row in its invoices' line items tables.


Line 1,854: Line 1,636:
# We have added a "Units" '''Data Column''' and assigned this '''Value Reader''' as its '''''Value Extractor'''''.
# We have added a "Units" '''Data Column''' and assigned this '''Value Reader''' as its '''''Value Extractor'''''.
# Ultimately, we will use this to extract the unit values from each row instance.
# Ultimately, we will use this to extract the unit values from each row instance.
|valign=top|
 
[[File:2023_TabularLayout_021_Primary-vs-Secondary-Extract_Row-Extract_01.png]]
[[File:2023_TabularLayout_021_Primary-vs-Secondary-Extract_Row-Extract_01.png]]
|-
 
|valign=top|
 
<br>
If we test extraction against our sample document, we will get everything but the "Unit" column.
If we test extraction against our sample document, we will get everything but the "Unit" column.
* Commonly, you will configure '''''Secondary Extract Modes''''' as override changes for a '''Document Type''', which is what we're choosing to do here.
* Commonly, you will configure '''''Secondary Extract Modes''''' as override changes for a '''Document Type''', which is what we're choosing to do here.
Line 1,865: Line 1,646:
# We've navigated to the "Overrides" tab.
# We've navigated to the "Overrides" tab.
# Testing out extraction, we have nothing populated for the "Unit" column.
# Testing out extraction, we have nothing populated for the "Unit" column.
#* This shouldn't be surprising. We have not established a column header for this '''Data Column''' because there is no header label to collect!
#* This shouldn't be surprising. We have not established a column header for this '''Data Column''' because there is no header label to collect!
# We will use the ''''''Column Settings'''''' properties to force the "Unit" '''Data Column''' to perform Secondary Extraction, using the ''Row Extract'' mode.
# We will use the ''''''Column Settings'''''' properties to force the "Unit" '''Data Column''' to perform Secondary Extraction, using the ''Row Extract'' mode.
|valgin=top|
 
[[File:2023_TabularLayout_021_Primary-vs-Secondary-Extract_Row-Extract_02.png]]
[[File:2023_TabularLayout_021_Primary-vs-Secondary-Extract_Row-Extract_02.png]]
|-
 
|valign=top|
 
<br>
# Select the '''Data Column''' you wish to configure.
# Select the '''Data Column''' you wish to configure.
#* The "Unit" '''Data Column''', in our case.
#* The "Unit" '''Data Column''', in our case.
Line 1,890: Line 1,670:
:&bull; You should set the '''''Secondary Extract''''' property to ''Always'' in this case.
:&bull; You should set the '''''Secondary Extract''''' property to ''Always'' in this case.
|}
|}
|valign=top|
 
[[File:2023_TabularLayout_021_Primary-vs-Secondary-Extract_Row-Extract_03.png]]
[[File:2023_TabularLayout_021_Primary-vs-Secondary-Extract_Row-Extract_03.png]]
|-
 
|valign=top|
 
<br>
# With the ''Row Extract'' mode enabled, we collect unit values for the "Unit" column.
# With the ''Row Extract'' mode enabled, we collect unit values for the "Unit" column.
# The "Unit" '''Data Column's''' extractor now executes against each full row, when Secondary Extraction is performed.
# The "Unit" '''Data Column's''' extractor now executes against each full row, when Secondary Extraction is performed.
# Click the '''Inspect''' button before moving on.
# Click the '''Inspect''' button before moving on.
|valign=top|
 
[[File:2023_TabularLayout_021_Primary-vs-Secondary-Extract_Row-Extract_04.png]]
[[File:2023_TabularLayout_021_Primary-vs-Secondary-Extract_Row-Extract_04.png]]
|-
 
|valign=top|
 
<br>
{|class="fyi-box"
{|class="fyi-box"
|-
|-
Line 1,910: Line 1,688:
The Instance Viewer is a tool to better understand the instances created and used in table extraction.
The Instance Viewer is a tool to better understand the instances created and used in table extraction.


The Instance Viewer can be extremely beneficial when configuring Secondary Extraction. Whether you're fine tuning ''Cell Extract'', trying get a closer look at the ''Geometric'' text data, or trying to set up a ''Row Extract'' extractor, the Instance Viewer will be your best friend.
The Instance Viewer can be extremely beneficial when configuring Secondary Extraction. Whether you're fine tuning ''Cell Extract'', trying get a closer look at the ''Geometric'' text data, or trying to set up a ''Row Extract'' extractor, the Instance Viewer will be your best friend.
|}
|}


Line 1,917: Line 1,695:
# The next level in the hierarchy will be cell instances.
# The next level in the hierarchy will be cell instances.
# The "Image View" tab will highlight the selected instance's physical location on the document.
# The "Image View" tab will highlight the selected instance's physical location on the document.
|valign=top|
 
[[File:2023_TabularLayout_021_Primary-vs-Secondary-Extract_Row-Extract_05.png]]
[[File:2023_TabularLayout_021_Primary-vs-Secondary-Extract_Row-Extract_05.png]]
|}


=== Footer Detection ===
=== Footer Detection ===
A "footer" is a text table that indicates where the table ''stops''. Some tables will have footers, and some won't. When the table does have a footer, Grooper can use this information to force-stop row detection.


A "footer" is a text table that indicates where the table ''stops''.  Some tables will have footers, and some won't.  When the table does have a footer, Grooper can use this information to force-stop row detection.


{|cellpadding=10 cellspacing=5
|valign=top style="width:40%"|
<br>
For example, as our '''Data Table''' using '''''Tabular Layout''''' is configured currently, we've collected one row we shouldn't have for the "Sonrasc" '''Document Type's''' line items table.
For example, as our '''Data Table''' using '''''Tabular Layout''''' is configured currently, we've collected one row we shouldn't have for the "Sonrasc" '''Document Type's''' line items table.
# There are only actually two rows on this document.
# There are only actually two rows on this document.
Line 1,935: Line 1,709:
#<li value=3> We have three matching results for all three of those columns on this line.</li>
#<li value=3> We have three matching results for all three of those columns on this line.</li>
#* As far as '''''Tabular Layout''''' is concerned, this counts as a row.
#* As far as '''''Tabular Layout''''' is concerned, this counts as a row.
|valign=top|
 
[[File:2023_TabularLayout_022_Footer-Detection_01.png]]
[[File:2023_TabularLayout_022_Footer-Detection_01.png]]
|-
 
|valign=top|
 
<br>
We will fix this issue using a footer. By defining a footer for this table, we can dictate where the table ''should'' end based on static text labels on the document.
We will fix this issue using a footer. By defining a footer for this table, we can dictate where the table ''should'' end based on static text labels on the document.


#<li value=4> For example, this phrase "THANK YOU FOR YOUR ORDER" is always found at the end of the line items table from this vendor.</li>
#<li value=4> For example, this phrase "THANK YOU FOR YOUR ORDER" is always found at the end of the line items table from this vendor.</li>
Line 1,948: Line 1,721:
* Using an extractor, by configuring the '''''Tabular Layout > Footer Detection''''' property.
* Using an extractor, by configuring the '''''Tabular Layout > Footer Detection''''' property.
* Using Label Sets, by collecting the '''Data Table's''' '''''Footer''''' label for one or more '''Document Types'''.
* Using Label Sets, by collecting the '''Data Table's''' '''''Footer''''' label for one or more '''Document Types'''.
|valign=top|
 
[[File:2023_TabularLayout_022_Footer-Detection_02.png]]
[[File:2023_TabularLayout_022_Footer-Detection_02.png]]
|}
==== <span style="color:#662d91; font-size:115%">Collecting a Footer Using an Extractor</span> ====


{|cellpadding=10 cellspacing=5
==== Collecting a Footer Using an Extractor ====
|valign=top style="width:40%"|
<br>
To establish a footer, using an extractor, you will configure the '''''Footer Detection''''' property.
To establish a footer, using an extractor, you will configure the '''''Footer Detection''''' property.
# Select the '''Data Table''' you wish to configure.
# Select the '''Data Table''' you wish to configure.
# Expand the '''''Tabular Layout''''' sub-properties.
# Expand the '''''Tabular Layout''''' sub-properties.
# Select the '''''Footer Detection''''' property.
# Select the '''''Footer Detection''''' property.
# Using the dropdown list, select the '''''Extractor Type''''' you wish to use.
# Using the dropdown list, select the extractor (Extractor Node or Value Extractor) you wish to use.
#* We're going with a '''''List Match''''' extractor for this tutorial.
#* We're going with a '''''List Match''''' extractor for this tutorial.
|valgin=top|
 
[[File:2023_TabularLayout_023_Footer-Detection_Collecting-a-Footer-Using-an-Extractor_01.png]]
[[File:2023_TabularLayout_023_Footer-Detection_Collecting-a-Footer-Using-an-Extractor_01.png]]
|-
 
|valign=top|
 
<br>
Configure the extractor to match something at the foot of the table.
Configure the extractor to match something at the foot of the table.


# Our list entry will be <code>THANK YOU FOR YOUR ORDER</code>
# Our list entry will be <code>THANK YOU FOR YOUR ORDER</code>
# This will match something on the document that's found at the end of this table (at least for this vendor).
# This will match something on the document that's found at the end of this table (at least for this vendor).
|valign=top|
 
[[File:2023_TabularLayout_023_Footer-Detection_Collecting-a-Footer-Using-an-Extractor_02.png]]
[[File:2023_TabularLayout_023_Footer-Detection_Collecting-a-Footer-Using-an-Extractor_02.png]]
|-
 
|valign=top|
 
<br>
# With the '''''Footer Detection''''' extractor configured, we will throw out the false positive row detected after our footer result.
# With the '''''Footer Detection''''' extractor configured, we will throw out the false positive row detected after our footer result.
# This row is after our footer. So, it is no longer detected.
# This row is after our footer. So, it is no longer detected.
# The two valid rows are collected accurately.
# The two valid rows are collected accurately.


Line 1,987: Line 1,752:
'''FYI'''
'''FYI'''
|
|
Keep in mind the '''''Footer Detection''''' property is a global property. It will be applied to all '''Document Types''' (unless overridden using '''Data Element Overrides''').
Keep in mind the '''''Footer Detection''''' property is a global property. It will be applied to all '''Document Types''' (unless overridden using '''Data Element Overrides''').
|}
|}
|valign=top|
 
[[File:2023_TabularLayout_023_Footer-Detection_Collecting-a-Footer-Using-an-Extractor_03.png]]
[[File:2023_TabularLayout_023_Footer-Detection_Collecting-a-Footer-Using-an-Extractor_03.png]]
|}


==== <span style="color:#662d91; font-size:115%">Collecting a Footer Using Label Sets</span> ====
==== Collecting a Footer Using Label Sets ====
 
{|cellpadding=10 cellspacing-5
|valign=top style="width:40%"|
<br>
Table footers can be established using Label Sets by collecting a '''''Footer''''' label for the '''Data Table'''.
Table footers can be established using Label Sets by collecting a '''''Footer''''' label for the '''Data Table'''.


Line 2,008: Line 1,768:
#* In this case we've lassoed the text <code>THANK YOU FOR YOUR ORDER</code>.
#* In this case we've lassoed the text <code>THANK YOU FOR YOUR ORDER</code>.
# Don't forget to save when finished.
# Don't forget to save when finished.
|valign=top|
 
[[File:2023_TabularLayout_024_Footer-Detection_Collecting-a-Footer-Using-Label-Sets_01.png]]
[[File:2023_TabularLayout_024_Footer-Detection_Collecting-a-Footer-Using-Label-Sets_01.png]]
|-
 
|valign=top|
 
<br>
# With the '''Data Table's''' '''''Footer''''' label collected for this '''Document Type''', we will throw out the false positive row detected after our footer result.
# With the '''Data Table's''' '''''Footer''''' label collected for this '''Document Type''', we will throw out the false positive row detected after our footer result.
# This row is after our footer. So, it is no longer detected.
# This row is after our footer. So, it is no longer detected.
Line 2,024: Line 1,783:
'''FYI'''
'''FYI'''
|
|
The Label Set approach is, in general, a more "templated" approach. You will need to collect a '''''Footer''''' label for each '''Document Type''' that needs one.
The Label Set approach is, in general, a more "templated" approach. You will need to collect a '''''Footer''''' label for each '''Document Type''' that needs one.
|}
|}


|valign=top|
[[File:2023_TabularLayout_024_Footer-Detection_Collecting-a-Footer-Using-Label-Sets_02.png]]
[[File:2023_TabularLayout_024_Footer-Detection_Collecting-a-Footer-Using-Label-Sets_02.png]]
|}


==== <span style="color:#662d91; font-size:115%">Capture Footer Row VS Display Total Row</span> ====
==== Capture Footer Row VS Display Total Row ====
''FYI: The '''Capture Footer Row''' property was introduced in version 2021.0046. Earlier minor versions do not have this property.''
 
The '''''Capture Footer Row''''' property creates a row instance at the bottom of the table, using the footer to establish the row.
* This row is ONLY for the benefit of a document reviewer. This data IS NOT actually collected as part of the table's data.


''FYI: The '''Capture Footer Row''' property was introduced in version 2021.0046.  Earlier minor versions do not have this property.''


The '''''Capture Footer Row''''' property creates a row instance at the bottom of the table, using the footer to establish the row.
* This row is ONLY for the benefit of a document reviewer.  This data IS NOT actually collected as part of the table's data.
{|cellpadding=10 cellspacing=5|
|valign=top style="width:40%"|
<br>
# First Grooper will locate the footer.
# First Grooper will locate the footer.
#* In this case we used a '''''Footer''''' label <code>SUBTOTAL</code>
#* In this case we used a '''''Footer''''' label <code>SUBTOTAL</code>
# Then, Grooper will create a row instance using the footer, ''instead of'' '''''Tabular Layout's''''' normal row detection methods.
# Then, Grooper will create a row instance using the footer, ''instead of'' '''''Tabular Layout's''''' normal row detection methods.
# This is now a row instance. If there is anything that can be extracted by a '''Data Column's''' extractor, it will be.
# This is now a row instance. If there is anything that can be extracted by a '''Data Column's''' extractor, it will be.
#* In our case, the "Line Total" '''Data Column's''' extractor returned the numerical value in this row.
#* In our case, the "Line Total" '''Data Column's''' extractor returned the numerical value in this row.
# Extracted values are then displayed in the "footer row" at the bottom of the table.
# Extracted values are then displayed in the "footer row" at the bottom of the table.


Values in these footer rows may be useful for your data reviewers. Often there are column totals that can be extracted from a footer row and used to validate information in the table rows above it.


Values in these footer rows may be useful for your data reviewers.  Often there are column totals that can be extracted from a footer row and used to validate information in the table rows above it.
|valign=top|
[[File:2023_TabularLayout_025_Footer-Detection_Capture-Footer-Row-vs-Generate-Footer-Row_01.png]]
[[File:2023_TabularLayout_025_Footer-Detection_Capture-Footer-Row-vs-Generate-Footer-Row_01.png]]
|-
 
|valign=top|
 
<br>
{|class="attn-box"
{|class="attn-box"
|-
|-
Line 2,063: Line 1,816:
# The '''''Capture Footer Row''''' is set to ''True'' by default.
# The '''''Capture Footer Row''''' is set to ''True'' by default.
# You will need to set this to ''False'' if you do ''not'' want to display the footer row when reviewing '''''Tabular Layout's''''' extraction results.
# You will need to set this to ''False'' if you do ''not'' want to display the footer row when reviewing '''''Tabular Layout's''''' extraction results.
|valign=top|
 
[[File:2023_TabularLayout_025_Footer-Detection_Capture-Footer-Row-vs-Generate-Footer-Row_02.png]]
[[File:2023_TabularLayout_025_Footer-Detection_Capture-Footer-Row-vs-Generate-Footer-Row_02.png]]
|}
 


Please be aware the '''''Capture Footer Row''''' is in some ways similar to the '''''Display Total Row''''' feature, but is exceptionally different from it in one major way.
Please be aware the '''''Capture Footer Row''''' is in some ways similar to the '''''Display Total Row''''' feature, but is exceptionally different from it in one major way.
Line 2,073: Line 1,826:
** The data is generated ''after'' extraction.
** The data is generated ''after'' extraction.


{|cellpadding=10 cellspacing=5
 
|valign=top style="width:40%"|
<br>
The '''''Display Total Rows''''' feature adds a row to the bottom of the table using ''solely'' a mathematical operation.
The '''''Display Total Rows''''' feature adds a row to the bottom of the table using ''solely'' a mathematical operation.


Line 2,083: Line 1,834:




The '''''Display Total Rows''''' feature is also useful for document reviewers. It just gets its results differently than the '''''Capture Footer Row''''' feature.
The '''''Display Total Rows''''' feature is also useful for document reviewers. It just gets its results differently than the '''''Capture Footer Row''''' feature.
|valign=top|
 
[[File:2023_TabularLayout_025_Footer-Detection_Capture-Footer-Row-vs-Generate-Footer-Row_03.png]]
[[File:2023_TabularLayout_025_Footer-Detection_Capture-Footer-Row-vs-Generate-Footer-Row_03.png]]
|-
 
|valign=top|
 
{|class="attn-box"
{|class="attn-box"
|-
|-
Line 2,106: Line 1,857:
* If both properties are enabled (set to ''True''), '''''Capture Footer Row''''' takes priority and a "footer row" will be displayed.
* If both properties are enabled (set to ''True''), '''''Capture Footer Row''''' takes priority and a "footer row" will be displayed.
* If you want to display a "total row", you ''must'' set the '''''Capture Footer Row''''' property to ''False''.
* If you want to display a "total row", you ''must'' set the '''''Capture Footer Row''''' property to ''False''.
|valign=top|
 
[[File:2023_TabularLayout_025_Footer-Detection_Capture-Footer-Row-vs-Generate-Footer-Row_04.png]]
[[File:2023_TabularLayout_025_Footer-Detection_Capture-Footer-Row-vs-Generate-Footer-Row_04.png]]
|-
 
|valign=top|
 
<br>
# When using '''''Generate Footer Row''''' be sure to select a '''Data Column'''...
# When using '''''Generate Footer Row''''' be sure to select a '''Data Column'''...
# ... and set the '''''Footer Mode''''' property to ''Calculate''.
# ... and set the '''''Footer Mode''''' property to ''Calculate''.
|valign=top|
 
[[File:2023_TabularLayout_025_Footer-Detection_Capture-Footer-Row-vs-Generate-Footer-Row_05.png]]
[[File:2023_TabularLayout_025_Footer-Detection_Capture-Footer-Row-vs-Generate-Footer-Row_05.png]]
|}
[[Category:Articles]]
[[Category:Version 2023]]

Latest revision as of 10:58, 2 September 2025

This article is about an older version of Grooper.

Information may be out of date and UI elements may have changed.

2025202420232021

The Tabular Layout Table Extract Method uses column header values determined by the view_column Data Columns Header Extractor results (or labels collected for the Data Columns when a Labeling Behavior is enabled) as well as Data Column Value Extractor results to model a table's structure and return its values.

The Tabular Layout is "Label Set aware". You can configure Tabular Layout with or without labels. This article will detail both methods. For more information on Label Sets, please visit the full Label Sets article.

You may download and import the file(s) below into your own Grooper environment (version 2023). There are two Batches with the example document(s) discussed in this tutorial, as well as two Projects configured according to its instructions.
Please upload the Projects to your Grooper environment before uploading the Batches. This will allow the documents within the Batches to maintain their classification status.

About

Many tables label the columns so the reader knows what the data in that column corresponds to. How do you know the unit price for an item on an invoice? Typically, that item is in a table and one of the columns of that table is labeled "Unit Price" or something similar. Once you read the labels for each column (also called "column headers"), you the reader know where the table begins (below the column headers) and can identify the data in each row (by understanding what the column headers refer to).

This is also the basic idea behind the Tabular Layout Extraction Method. It too utilizes column header labels to "read" tables on documents, or at least as step number one in modeling the table's structure. Once Grooper knows where a column is, identified by the column's header label, Grooper can extract data from each cell in each row of that column.

The Tabular Layout method can establish column header locations in one of two ways:

  1. Using extractors
    • Which are defined on the Data Columns' Header Extractor property (or alternatively on the Data Table's Header Row Extractor property)
  2. Using Label Sets
    • When a Labeling Behavior is enabled, column header locations are defined by labels collected for the Data Columns (and optionally for the Data Table)
    • Effectively, the labels take the place of the Header Extractor results (or alternatively the Header Row Extractor results)

Once the column header locations are established, the next thing Grooper needs to do is figure out where each row is. Tabular data is most often dynamic data. A table on one document might have two rows. The same table on the next might have twenty. How does Grooper know where each row is?

This is done by configuring at least one Data Column's Value Extractor property (However, more than one, even all, may be configured. Depending on how complicated the table is, you may need to configure extractors for multiple columns.)

Generally, there is at least one column in a table that is always present for every row in the table. If you can use an extractor to locate that data below its corresponding column header, that gives you a way of finding each row in the table. This allows Grooper to form a "row instance" for each row. Once the row instance is established, Grooper can then collect the various cell values for the various additional columns from the row instance.

If locating column headers and locating rows using column extractors was all that was involved in Tabular Layout, that alone would make it a powerful tabular extraction method. What makes the Tabular Layout method even more powerful is its further configurability. Is every row in the table a single line or are the rows "multiline"? Do you need more fine-tuned data extraction from a cell's value or the row itself once the row instance is detected? Do you need to establish a table "footer" to limit the number of rows extracted? We will address these issues and more in the #Advanced Setup Considerations section of this article.

FYI

If your familiar with the Header-Value table extraction method, you should see some similarities between it and the Tabular Layout method. Indeed both methods utilize column headers and Data Column Value Extractors to collect table data.

Tabular Layout should be seen as an improvement on Header-Value for the following reasons:

  1. Tabular Layout is Label Set aware.
  2. Tabular Layout is typically less involved to set up.
  3. Tabular Layout has more configuration options, giving it a better capability to extract data from a large set of disparate table structures (Usually executed through Data Element Overrides).

Basic Setup

Tabular Layout can be configured with or without the use of Label Sets. In either case, the basic setup is the same:

  1. Establish column headers for each Data Column.
  2. Detect row instances by assigning at least one Data Column's Value Extractor.
  3. Set the Data Table's Extract Method property to Tabular Layout.
  4. Test extraction and configure further as necessary.

With Label Sets or without, the setup is extremely similar. On top of that, there's nothing about using Label Sets that alters Tabular Layout's extraction logic. Grooper uses the same logic to model the table's structure and collect data for each cell. The biggest difference is how column headers are determined in step #1.

  • Without Label Sets, column headers are established using extractors, defined using the Data Columns' Header Extractor property (or alternatively using the Data Table's Header Row Extractor property)
  • With Label Sets, column headers are established using labels, defined when collecting labels for each Document Type. The Data Columns' labels effectively take the place of the Header Extractor property's results.

Tabular Layout Without Label Sets

Overview

This tutorial will cover the basic configuration of the Tabular Layout method without Label Sets, using extractors to collect column headers instead. We will use invoices for our document set and collect the following data from their tables detailing line item information:

  • Item Number - The vendor's id number for the item ordered for each row.
  • Description - The description of each item ordered for each row.
  • Quantity - The number of the item ordered for each row.
  • Unit Price - The vendor's price for the item ordered for each row.
  • Line Total - The total price for the number of items ordered (In other words, the quantity ordered multiplied by the unit price)


The basic steps will be as follows:

  1. Establish column headers by configuring the Header Extractor property of each Data Column in the Data Table.
    • You must configure header extractors for each Data Column whose data you want to collect.
    • Alternatively, you may configure a Header Row Extractor set on the Data Table (This property is found in the Tabular Layout sub-properties).
  2. Assign a Value Extractor for at least one Data Column.
    • For example, we may expect to find a quantity for each item shipped on an invoice, regardless of the vendor. There's always a column with a "Quantity" or "QTY" or "Shipped" or some similar header.
    • Since this data is also present on every row, this will provide the information necessary to find each row in the table.
    • While you need at least one Data Column's Value Extractor configured to detect rows, multiple columns may be used to detect rows.
      • Furthermore, a Data Column's Value Extractor will either perform "Primary Extraction" to perform row detection or "Secondary Extraction" to extract data from already detected rows. We will discus using multiple columns to detect rows and the differences between "Primary" and "Secondary Extraction" in the #Advanced Setup Considerations section of this article.


  1. Set the Data Table object's Extract Method property to Tabular Layout.
    • And configure any Tabular Layout properties as needed. We will discuss many of these properties, why and how to to use them in the #Advanced Setup Considerations section of this article.
  2. Test to ensure the table's data is collected.


In a perfect world, you're done at that point. As you can see in this example, we've populated a table. Data is collected for all four Data Columns for each row on the document.

However, the world is rarely perfect. We will discuss some further configuration considerations to help you get the most out of this table extraction method in the #Advanced Setup Considerations section below.

1. Configure Header Extractors

As far as strict requirements go for the Tabular Layout method goes, you must at minimum establish column headers for each Data Column you wish to extract.

We'll start with the "Quantity" Data Column.

  • FYI: If the invoice lists both a "quantity ordered" and a "quantity shipped" column, we will be collecting the quantity shipped.


  1. Select the Data Column.
  2. Select the Header Extractor property.
    • Here you will set an extractor to locate the column header on the document for the selected Data Column.
  3. Using the dropdown selector, select the extractor (Extractor Node or Value Extractor) you wish to configure to return the column header.
    • You can use whatever extractor you want to get the job done. You may select Reference to reference a Data Type or Value Reader extractor node you've configured already. Or, you can select one of Grooper's Value Extractors to configure extraction locally.
    • We're going to select List Match.


The List Match extractor is well suited for our purposes here. Ultimately, we will enter a list of various ways a "Quantity" column can be labeled.

  1. For example, this document labels quantities of each item ordered as "HRS / QTY"
  2. So, we've added HRS / QTY to the Local Entries list.
  3. Other documents use the label "Quantity" or "Shipped". So, we've added Quantity and Shipped to the list as well.

You would then continue adding variations to the list until all variations of the "Quantity" column's header labels are extracted for every variation of the table.

  • Or more generally, until a result for the column header is extracted using whatever extractor you've chosen to configure.

Pro Tip: Stacked Labels

You will often find "stacked labels" in tables. These are multi-word labels broken up across multiple lines in the table's header.

  1. For example, this document's "Quantity" column uses "Qty Shp." for its label.
    • This is a stacked label, with "Qty" on one line and "Shp." on another.
  2. We can add "Qty Ship." to our list of header labels.
  3. However, we will not get a result returned for the document.


We can easily resolve this by enabling the Vertical Wrap feature.

  • This feature is only available to the List Match extractor. This is one of the reasons why List Match is so useful for extracting column headers.

To enable Vertical Wrap:

  1. Switch to the "Properties" tab.
  2. Change the Vertical Wrap property to Enabled.
  3. With Vertical Wrap enabled, the extractor is able to match and return items in the list that wrap vertically on multiple lines.
    • In our case, our stacked label "Qty Shp." is now returned.


Repeat Until All Data Columns Are Configured

You will repeat the same process for each Data Column you want to collect.

  1. We want to collect data from all these columns.
  2. So, we've configured each Data Column's Header Extractor property.


Once the Header Extractor for each Data Column is configured, Grooper will "know" where our tables "start". However, all the actual data in the table is defined by its rows. How does Grooper know where each row is? We will discuss that in the next tab.

For our document set, we used the following lists of header column labels:

"Item Number"

ITEM NO
ITEM #
Item Number
Part Number/Description
PART NUMBER

"Description"

ITEM DESCRIPTION
DESCRIPTION
Part Number/Description

"Quantity"

HRS / QTY
Quantity
Shipped
Qty Shp.
Qty

"Unit Price"

RATE / PRICE
UNIT PRICE
Unit Rate

"Line Total"

SUBTOTAL
TOTAL
Extended Price
Ext. Price
Ext Price
NET AMOUNT
Value
Line Total

FYI

You may have noticed Part Number/Description is present in both the "Item Number" and "Description" columns' header lists.

This can happen. Depending on a table's format, what would normally be divided up between two columns on other documents may be jammed into one. Tabular Layout has methods to account for this, using what's called "Secondary Extraction".

• For more information on Secondary Extraction, please visit the #Primary VS Secondary Extraction portion of this article.

2. Assign a Data Column's Value Extractor

This step is all about row detection.

So far all we've done is established header column positions on each document. But, that's not where the data is. The table's data is in the rows.

As it stands, Grooper doesn't know anything about the rows in the tables. It doesn't know the size of each row. It doesn't know what kind of data is supposed to be in the rows. Maybe most importantly, it doesn't know how many rows there are. Tables tend to be dynamic. They may have 3 rows on one document and 300 on the next. Grooper needs a way of detecting this.

To detect rows, we need at least one Data Column's Value Extractor property configured. For each result the extractor produces below the column's header, Grooper will create one row instance.

The key thing to keep in mind is this data must be present on every row. You'll want to pick a column whos data is always present for every row, where it would be considered invalid if the information wasn't in that cell for a given row.

In our case, we will choose the "Quantity" Data Column. We always expect (for the time being anyway) there to be a quantity listed for the line item on the invoice.

  1. We will use this Value Reader for our demonstration.
    • However, in the real world, the extraction world is your oyster. You'll configure an extractor to best target the data in whatever table column you're trying to extract.
  2. This is a fairly simple Pattern Match extractor designed to return numeric data (including currency).
  3. The regex is a fairly simple pattern to match generic quantities.
    • It'll match decimal values from 0 and above with two decimal places optional.
  4. We've also edited our Prefix and Suffix Patterns so that the pattern must be surrounded by a space character before and after, with an optional dollar sign before the number.
  5. As you can see, we get five results below the "Quantity" label.
    • When we assign this Value Reader to the "Quantity" Data Column, we should then get five rows when this table extracts.


We do get a bunch of other hits as well. This is a very generic extractor matching very generic numerical data.

  1. Will this result present a problem? Will we get an extra row for its result?
    • No. That result is above the header label HRS / QTY established by the Data Column's Header Extractor.
    • The Tabular Layout method presumes rows are below column labels. Any and all results above the first instance of the column's headers will be ignored.
  2. What about these matching results on the same line? Will the extra results create additional row instances?
    • No. These results are misaligned with the "Quantity" Data Column's header. They are too far to the right to be considered under the column header. They will be ignored.
    • Only results aligned with the "Quantity" Data Column's header will create a row instance.
  3. What about these results? Will they produce a row?
    • No. These results are also misaligned with the "Quantity" Data Column's header.
    • That said, if these were aligned with the "Quantity" Data Column's header, they would produce row instances.
    • When you are building your own Data Column extractors, pay close attention to results below the column's header. They have the most potential to produce false positive results, producing erroneous rows.
      • That said, there are a multitude of ways to avoid false positive row results when using Data Columns' Value Extractors to detect rows. We will discuss this more in the #Advanced Setup Considerations portion of this article.


With our extractor ready to go, all we need to do is assign it to the "Quantity" Data Column using its Value Extractor property.

  1. Select the Data Column you wish to configure.
    • In our case, we want to configure the "Quantity" Data Column.
  2. Configure the Value Extractor property.
    • In our case, we've referenced our Value Reader designed to return generic numeric values.


FYI

At bare minimum you must configure at least one Data Column's Value Extractor to perform row detection.

However, multiple columns may be used to perform row detection by configuring their corresponding Data Columns Value Extractor properties. For more information on using multiple columns in row detection (as well as row detection in general) please visit the #Advanced Row Detection section of this article.


So far, we have:

  1. Collected labels for the Data Column labels (and optionally the whole row of column labels for the Data Table)
  2. Configured at least one Data Column with its Value Extractor configured.

For fairly simple table structures, we now have the two things the Tabular Layout method needs to extract data. Now, all we need to do is tell the Data Table object we want to use the Tabular Layout method. We do this by setting its Extract Method property to Tabular Layout.

3. Set Extract Method to Tabular Layout

A Data Table's extraction method is set using the Extract Method property. To enable the Tabular Layout method, do the following.

  1. Select a Data Table object in your Data Model.
    • Here, we've selected the "Line Items" Data Table.
  2. Select the Extract Method property.
  3. Using the dropdown menu, select Tabular Layout


4. Test

Now, let's test out what we have and see what we get!

  1. For the selected document folder in the "Batch Viewer" window...
  2. Press the "Test Extraction" button.
  3. The results show up in the "Data Element Preview" window.


So, how was Grooper able to do this? For the Tabular Layout method, the Data Table is populated using primarily two pieces of information: column header locations established by the Data Columns' Header Extractors and rows locations detected by a Data Column's Value Extractor.

  • Remember, we configured Header Extractors for all Data Columns. We configured only the "Quantity" Data Column's Value Extractor'.

First, it's all about establishing column headers.

  1. The Data Columns' Header Extractors established the column locations for each column.
  2. Grooper then determines the width of these columns.
    • If table lines are present, Grooper can detect those line locations via a Line Detection (or Line Removal) IP Command. Grooper will "snap" the column's width to the detected line boundaries, expanding the cell's width (and height) to the boundaries around it.
      • Table lines give human readers an indicator of where the data "lives" (or is contained). If it's in the box, it belongs to the column. If it's out of the box, it belongs to a different column.
    • If table lines are not present (as is the case for this document), Grooper performs a variety of gutter-detection operations, analyzing the whitespace between columns to determine their widths.
      • Most commonly Grooper will average the distance between one header label and the next.


Second, it's all about detecting rows. Rows are detected using a Data Column's Value Extractor.

  • In our case, we configured the "Quantity" Data Column's Value Extractor.
  • FYI: When a Data Column's extractor is used to detect rows, it is considered "Primary Extraction". A Data Column's extractor can also be used for "Secondary Extraction", performed after rows are detected. For more on this, please visit the #Primary VS Secondary Extraction section of this article.
  1. Rows are only detected below the detecting Data Column's header.
  2. Grooper runs the detecting Data Column's Value Extractor, looking for matching results aligned below the column header.
  3. For each result returned, Grooper establishes one row instance.
    • Since our extractor was designed to return decimal values, and Grooper found five decimal values below our column header, Grooper detected five rows.


The Tabular Layout method now has the two pieces of information it needs to determine the table's structure. If you know where the columns are and how big they are, and you know how many rows there are, you pretty much know what the table looks like. Grooper can infer the table's grid-like structure using the column and row positions.

  1. It has column instances for each Data Column.
    • Again, established by each Data Column's Header Extractor'.
  2. It has row instances for each detected row.
    • Again, established by the detecting Data Column's Value Extractor.
      • FYI: More than one Data Column can be used to detect rows. Please visit the #Advanced Row Detection section for more information.


With these column and row instances established, Grooper can form data instances for each cell of the table.

  1. Each cell's data simply lays where the columns and rows intersect.
    • For Data Columns with their Value Extractors configured, values are either collected using "Primary" or "Secondary Extraction". Please see the #Primary VS Secondary Extraction portion for more information.
    • For Data Columns without their Value Extractors configured, values are collected by returning the OCR or native text data within the geometric boundaries of the cell.
      • This is extremely beneficial for data that is difficult to extract using pattern matching.
      • For example, invoice item numbers and descriptions are notoriously difficult to pattern match. By using something in the table that is easy to pattern match, like our item quantities, we can use Tabular Layout to model the table structure and collect the other column values that are not.

5. Alternative Configuration: Header Row Extractor

You may alternatively establish column headers for the entire row of header labels, using the Header Row Extractor property. Instead of configuring each Data Column's Header Extractor, you would configure an extractor to return the whole table's row of column headers and use named instances (either Named Groups or child extractors) to establish each Data Column's header.

There are two reasons using a Header Row Extractor can be beneficial:

  1. It can be a way to throw out false positive column matches.
  2. It can be a way to better take advantage of Fuzzy RegEx.

Configuring the Header Row Extractor will override all Data Columns Header Extractors.

You should choose to either establish column headers using the Header Row Extractor or do so using each Data Column's Header Extractors.

You may find it beneficial to configure Data Column Header Extractors as the "default" configuration and use Data Element Overrides to pick and choose which Document Types you want to use the Header Row Extractor instead.

Craft the Extractor

To configure the Header Row Extractor, you will need to craft an extractor (or multiple extractors for multiple table formats). We will choose to do that first by creating a few Value Reader and Data Types.

  1. We've started creating a Value Reader to use as a Header Row Extractor for the "Fairdeal" Document Type in our Content Model.
  2. We're using a Pattern Match extractor.
    • We can easily match the header row for "Fairdeal" invoices using a simple regex pattern.
  3. Your first task will be to extract the entire row of column headers. The pattern we have here will do just that.
DESCRIPTION\t
ITEM NO\t
HRS / QTY\t
PER\t
RATE / PRICE\t
SUBTOTAL
  1. The pattern matches the whole row of column headers.


This is only step one. Next, we need some way of breaking up the result into each component column. How does Grooper know what part of the result is the label for the "Description" column or the "Quantity" column? It doesn't until you break up the result into named instances that match the names of your Data Columns in the Data Table. These named instances can either be:

  • Named Groups
  • Named Child Extractors
Assign Named Instances: Using Named Groups

When pattern matching a header row, you can do this with Named Groups.

  1. We've placed the portion of the regular expression matching the "Description" column's label in a Named Group.
    • (?<Description>DESCRIPTION)
    • The group is created just like any group by placing the regex in parenthesis
      • (regex goes here)
    • The group is named by inserting the ?<> tag.
      • ?<>(regex goes here)
    • The name is given by typing it between the angle brackets.
      • (?<name goes here>regex goes here)
  2. This produces a named instance capturing only the regex in the group.
    • In this case, the label for the "Description" column.
  3. The key here is that the name we gave the group, "Description", matches the name of the Data Column, "Description".
    • Since the names match, Grooper will use the Named Group's instance to establish the column header for the "Description" Data Column.
    • Effectively, the Named Group supplies the result for the Data Column's Header Extractor.
      • BE AWARE! This also means the Named Group replaces the result of a Data Column's Header Extractor. If you configure a Header Row Extractor, it will supersede any Header Extractor on any Data Column.


  1. You would then continue placing Named Groups around the remaining column headers, chunking out the regex and matching each chunk with the corresponding Data Column.

The following regex would accomplish this goal in our case.

(?<Description>DESCRIPTION)\t
(?<Item_Number>ITEM NO)\t
(?<Quantity>HRS / QTY)\t
PER\t
(?<Unit_Price>RATE / PRICE)\t
(?<Subtotal>SUBTOTAL)

Please note space characters are not allowed in Named Groups. You must replace a space character with an underscore _.

For example, to match the "Item Number" Data Column, we named the group Item_Number

Assign Named Instances: Using Named Child Extractors

You may also create and use the named instances by naming a Data Type's child extractors to match the names of your Data Columns.

  1. For example, this Data Type uses the Ordered Array collation method to return the header row for our "Factura" Document Type.
  2. We still get one complete result for the header row on this invoice format.
  3. Instead of a single regex pattern, we're collating results from its child extractors.
  4. Each child extractor's name matches one of our Data Columns.
  5. Inspecting the header row's instance (by right-clicking the result in the Results list), we can see more clearly how these results sub-instances will be supplied as each Data Column's header.


  1. In the Instance Viewer, we can select any of our sub-instances from our child extractors.
  2. This result is what will be used for the "Description" column's header.
  3. Since the name of the child extractor (and therefore also sub-instance) matches the "Description" Data Column, the result will be used in place of its Header Extractor.

Assign the Header Row Extractor

Now that we have a couple examples of header row extractors, we can assign them using Tabular Layout's Header Row Extractor property.

  1. To assign a header row extractor, select the Data Table.
  2. Expand the Tabular Layout sub-properties.
  3. Expand the Header Detection sub-properties.
  4. Using the Header Row Extractor property, configure your header row extractor.
  5. In our case, we set Header Row Extractor to Reference and pointed to one of the extractors detailed previously.


But be careful! If you choose this approach, assigning the Header Row Extractor will supplant any Header Extractor configuration for any of your Data Columns.

• If configured, the Header Row Extractor establishes column headers instead of multiple Data Columns Header Extractors.

The extractor we referenced was very specifically designed with only one table format in mind. It works for invoices assigned the "Fairdeal" Document Type, but no others.

  1. If we were to test our Data Table on a different document with a different table structure, we would get no results.
  2. Because the extractor doesn't match this table format's row of column headers, it can't establish any column headers for this document.
  3. This is despite the fact these Data Columns have their Header Extractor properties configured to do so.


If you take this approach to establish column headers you will either need to:

  • Craft a single extractor that matches multiple row header formats.
  • Or, use Data Element Overrides to configure a unique Header Row Extractor for each Document Type.

Why Bother?

There are two main reasons why Header Row Extractors can be beneficial:

  1. To throw out false positive column header matches
  2. To better match column headers with poor OCR using Fuzzy RegEx.
To Throw Out False Positives

The first reason to use a Header Row Extractor is to help eliminate false positive column header matches.

  1. Take our "Line Total" Data Column.
  2. Its Header Extractor is configured with List Match extractor, matching a variety of possible header labels for this column


  1. This table format uses the label SUBTOTAL for the "Line Total" column.
  2. It certainly matches the column header correctly.
  3. But it also matches an instance on this document where the same term is used to refer to something different.
    • This is a false positive match.


A row of header labels tends to be more specific (and requires more specific extraction logic).

  1. If we refer back to our Header Row Extractor for this document, we'll see there is no potential false positive.
  2. The extractor matches the label SUBTOTAL as part of the larger row of headers.
  3. Given that the extractor is now looking for that label within the larger context of a header row, our false positive is no longer returned.


This is to be sure a more specific, and therefore more accurate extractor. However, you shouldn't always assume more accurate is necessarily "necessary". In this case, the false positive did not impact our table whatsoever. So, while, yes, the Header Row Extractor is technically more accurate, our Data Table would have returned accurate data using Data Column headers alone (even with the false positive match).

  • While a Header Row Extractor can eliminate false positive column header matches, you only need to go through the trouble of configuring one if those false positive matches poorly impact your data extraction.

For Fuzzy RegEx

The other reason to use a Header Row Extractor has to do with imperfect OCR text data and Fuzzy RegEx. Fuzzy RegEx provides a way for regular expression patterns to match in Grooper when the text data doesn't strictly match the pattern. The difference between the regex pattern Grooper and the character string "Gro0per" is just off by a single character. An OCR engine misreading an "o" character for a zero is not uncommon by any means, but a standard regex pattern of Grooper will not match the string "Gro0per". The pattern expects there to be an "o" where there is a zero.

Using Fuzzy RegEx instead of regular regex, Grooper will evaluate the difference between the regex pattern and the string. If it's similar enough (if it falls within a percentage similarity threshold) Grooper will return it as a match.

  • FYI "similarity" may also be referred to as "confidence" when evaluating (or scoring) fuzzy match results. Grooper is more or less "confident" the result matches the regex pattern based on the fuzzy regex similarity between the pattern and the imperfect text data. A similarity of 90% and a confidence score of 90% are functionally the same thing (One could argue there is a difference between these two terms when Fuzzy Match Weightings come into play, but that's a whole different topic. And you may encounter Grooper users who use the terms "similarity" and "confidence" interchangeably regardless. Visit the Fuzzy RegEx article if you would like to learn more).


Let's go back to the List Match extractor for our "Line Total" Data Column's Header Extractor.

  1. This table format uses the label TOTAL for the "Line Total" column.
  2. However, it does not match the header on the document.
  3. Why not? This is due to imperfect OCR results.
    • The label TOTAL was misrecognized as TOFAL.


We can certainly get this label to match with Fuzzy RegEx, but only at a fairly low similarity.

  1. Here, we've enabled Fuzzy Matching and set the Minimum Similarity to 85%.
  2. We do get our header label returned.
  3. But it's at a confidence score of 86%.
    • This score may be too low. It's not causing a problem for this document, but it may pose issues for others.


The reason why the similarity score is so low is because "TOTAL" is a relatively small word, five characters long. Grooper's confidence rating in a match lessens, the more character swaps it has to make to match the word.

An entire row of headers, on the other hand, has much more characters in it. The cost to swap a single character in the entire row of headers will be much less, and much more negligible.

  1. This Value Reader is designed to match the whole header row for this invoice format.
  2. Its Fuzzy Matching property is enabled with its Minimum Similarity set to 90%.
  3. The whole header row matches at a much higher confidence score of 98%.

Disabling Data Columns for Specific Document Types

Occasionally, you will run into a situation where you want to collect a column that exists for some document formats but not for others. You will need to utilize Data Element Overrides to account for this.

For example, some of these invoices list a "unit of measure". The customer is invoiced for "1 each" of a product or "2 hours" of a service. "Each" or "hours" is the unit of measure. However, not all invoices have a column for this in their line items. You may want to collect the unit of measure if the column is present. So, you would add a "Unit" Data Column and configure its Header Extractor and, if necessary, Value Extractor properties.

But obviously, you can't collect it from documents where there is no "unit of measure" column. The "Factura" Document Type is one such vendor who does not list a unit of measure. You would need to remove the Header Extractor in the Document Type's "Overrides" panel.


  1. Here, we've selected the "Factura" Document Type.
  2. Navigate to the "Overrides" tab to configure Data Element Overrides.
  3. Select the Data Column you wish to override.
    • In this case, since the "Unit" column does not exist for the "Factura" Document Type we are removing its Header Extractor.
  4. Change the Header Extractor property to (none).
  5. FYI It would be beneficial to turn this Data Column's Visible property to False in this case. This would not affect extraction, but it would remove the column from a data reviewer's sight.


By removing the absent column's Header Extractor Grooper is no longer looking for a header that is not there! The table will then extract successfully.

Tabular Layout With Label Sets

Overview

This tutorial will cover the basic configuration of the Tabular Layout method with Label Sets, using a Labeling Behavior to collect column headers. We will use invoices for our document set and collect the following data from their tables detailing line item information:

  • Item Number - The vendor's id number for the item ordered for each row.
  • Description - The description of each item ordered for each row.
  • Quantity - The number of the item ordered for each row.
  • Unit Price - The vendor's price for the item ordered for each row.
  • Line Total - The total price for the number of items ordered (In other words, the quantity ordered multiplied by the unit price)


The basic steps will be as follows:

  1. Establish column headers by collecting labels for each Data Column in the Data Table.
    • You must collect header labels for each Data Column whose data you want to collect.
    • You may optionally collect a label for the entire row of header labels by collecting a label for the Data Table.
      • It is also considered best practice to do so when using Label Sets to configure Tabular Layout.
  2. Assign a Value Extractor for at least one Data Column.
    • For example, we may expect to find a quantity for each item shipped on an invoice, regardless of the vendor. There's always a column with a "Quantity" or "QTY" or "Shipped" or some similar header.
    • Since this data is also present on every row, this will provide the information necessary to find each row in the table.
    • While you need at least one Data Column's Value Extractor configured to detect rows, multiple columns may be used to detect rows.
      • Furthermore, a Data Column's Value Extractor will either perform "Primary Extraction" to perform row detection or "Secondary Extraction" to extract data from already detected rows. We will discus using multiple columns to detect rows and the differences between "Primary" and "Secondary Extraction" in the #Advanced Setup Considerations section of this article.


  1. Set the Data Table object's Extract Method property to Tabular Layout.
    • And configure any Tabular Layout properties as needed. We will discuss many of these properties, why and how to to use them in the #Advanced Setup Considerations section of this article.
  2. Test to ensure the table's data is collected.


In a perfect world, you're done at that point. As you can see in this example, we've populated a table. Data is collected for all four Data Columns for each row on the document.

However, the world is rarely perfect. We will discuss some further configuration considerations to help you get the most out of this table extraction method in the #Advanced Setup Considerations section below.

1. Collect Column Labels

The following tutorial will presume you have general familiarity with collecting labels. See the Label Sets article for a full explanation of how to collect labels for Document Types in a Content Model.

As far as strict requirements go for the Tabular Layout method goes, you must at minimum establish column headers for each Data Column you wish to extract.

We'll start with the "Quantity" Data Column.

  • FYI: If the invoice lists both a "quantity ordered" and a "quantity shipped" column, we will be collecting the quantity shipped.


For this "Fairdeal" Document Type, one column header label has been collected for each of the five Data Column children of the "Line Items" Data Table.

  1. The label ITEM NO for the "Item Number" Data Column
  2. The label DESCRIPTION for the "Description" Data Column
  3. The label HRS / QTY for the "Quantity" Data Column
  4. The label PER for the "Unit" Data Column
  5. The label RATE / PRICE for the "Unit Price" Data Column
  6. The label SUBTOTAL for the "Line Total" Data Column


As far as strict requirements go for establishing header columns, you're done at this point. You would then repeat this same process for every Document Type in your Content Model.


Best Practice: Collect a Header Row Label for the Data Table

You may optionally collect a label for the entire row of column header labels (aka the "header row label"). This label is collected for the parent Data Table object's label.


  1. We've collected the label DESCRIPTION ITEM NO HRS / QTY PER RATE / PRICE SUBTOTAL for the "Line Items" Data Table.


It is considered best practice to capture a header row label for the Data Table. But if it's optional, why do it? What is the benefit of this label?

Why Bother?

There are two main reasons why Header Row Extractors can be beneficial:

  1. To throw out false positive column header matches
  2. To better match column headers with poor OCR using Fuzzy RegEx.
To Throw Out False Positives

The first reason to collect a header row label is to help eliminate false positive column header matches.

  1. Take our "Line Total" Data Column's label SUBTOTAL.
  2. Without the Data Table's header row label, this label would also produce a match.
    • This is a false positive match. This is an instance on this document where the same term is used to refer to something different.
  3. With the header row label, only the actual label for the column matches.
    • Another way of putting it: The Data Column header labels will only match if they are part of the larger Data Table header row label.

For Fuzzy RegEx

The other reason to collect a header row label has to do with imperfect OCR text data and Fuzzy RegEx. Fuzzy RegEx provides a way for regular expression patterns to match in Grooper when the text data doesn't strictly match the pattern. The difference between the regex pattern Grooper and the character string "Gro0per" is just off by a single character. An OCR engine misreading an "o" character for a zero is not uncommon by any means, but a standard regex pattern of Grooper will not match the string "Gro0per". The pattern expects there to be an "o" where there is a zero.

Using Fuzzy RegEx instead of regular regex, Grooper will evaluate the difference between the regex pattern and the string. If it's similar enough (if it falls within a percentage similarity threshold) Grooper will return it as a match.

  • FYI: "Similarity" may also be referred to as "confidence" when evaluating (or scoring) fuzzy match results. Grooper is more or less "confident" the result matches the regex pattern based on the fuzzy regex similarity between the pattern and the imperfect text data. A similarity of 90% and a confidence score of 90% are functionally the same thing (One could argue there is a difference between these two terms when Fuzzy Match Weightings come into play, but that's a whole different topic. And you may encounter Grooper users who use the terms "similarity" and "confidence" interchangeably regardless. Visit the Fuzzy RegEx article if you would like to learn more).


So how does this apply to the Data Table's header row label? The short answer is it provides a way to increase the accuracy of Data Column header labels by "boosting" the similarity of the label to imperfect OCR results.

  1. We're going to look at labels collected for the "Rechnung" Document Type to illustrate this.
  2. Examine the collected label for the "Line Total" Data Column.
    • Notice the label TOTAL is highlighted red. The label doesn't match the text on the document.
    • This is due to imperfect OCR results.
  3. OCR made some missteps and recognized that segment as TOFAL.
    • The second "T" in "TOTAL" was recognized as an "F" character.
    • This means "TOTAL" (the expected label) is one character's difference from "TOFAL" (the actual text data). Or, "TOFAL" is 80% similar to "TOTAL".
    • The Labeling Behavior's similarity threshold is set to 90% for this Content Model. 80% is less than 90%. So, the result is thrown out.
    • FYI: This threshold is configured when the Labeling Behavior is added, using the Behaviors property of a Content Model. The Label Similarity property is set to 90% by default, but can be adjusted at any time.


As we will see, capturing the full row of column header labels will boost the similarity, allowing the label to match without altering the Labeling Behavior's fuzzy match settings.


  1. Here, we've collected a header row label for the Data Column.
  2. Now the "Line Total" Data Column's label matches! MAGIC!


Not magic. Just math.

The Data Table's column header row label is much much longer than a single Data Column's column header label. There are just more characters in PO ITEM # DESCRIPTION QUANTITY UNIT PRICE TOTAL\r\nLINE # than TOTAL (55 vs 5).

  • Where the "Line Total" Data Column's label is 80% similar to the text data (4 out of 5 characters), the "Line Item" Data Table's label, comprised of the whole row of column labels, is roughly 98% similar to the text data (54 out of 55 characters).

Utilizing a Data Table label allows you to hijack the whole row's similarity score when a single Data Column does not meet the similarity threshold.

  • If the label can be matched as a part of the larger whole, its confidence score goes up much further than by itself.
  • The Data Table's larger label of the full row of column labels gives extra context to the "Line Items" Data Column label, providing more information about what is and is not an appropriate match.


So why is it considered best practice to capture a header row label for the Data Table? OCR errors are unpredictable.

The set of examples you worked with when architecting this solution may have been fairly clean with good OCR reads. Maybe it didn't seem like you needed a Data Table label at the time, but that may not always be the case. Capturing a Data Table label for the header row will act as a safety net to avoid unforeseen problems in the future.

2. Assign a Data Column's Value Extractor

This step is all about row detection.

So far all we've done is established header column positions on each document. But, that's not where the data is. The table's data is in the rows.

As it stands, Grooper doesn't know anything about the rows in the tables. It doesn't know the size of each row. It doesn't know what kind of data is supposed to be in the rows. Maybe most importantly, it doesn't know how many rows there are. Tables tend to be dynamic. They may have 3 rows on one document and 300 on the next. Grooper needs a way of detecting this.


To detect rows, we need at least one Data Column's Value Extractor property configured. For each result the extractor produces below the column's header, Grooper will create one row instance.

The key thing to keep in mind is this data must be present on every row. You'll want to pick a column whos data is always present for every row, where it would be considered invalid if the information wasn't in that cell for a given row.

In our case, we will choose the "Quantity" Data Column. We always expect (for the time being anyway) there to be a quantity listed for the line item on the invoice.

  1. We will use this Value Reader for our demonstration.
    • However, in the real world, the extraction world is your oyster. You'll configure an extractor to best target the data in whatever table column you're trying to extract.
  2. This is a fairly simple Pattern Match extractor designed to return numeric data (including currency).
  3. The regex is a fairly simple pattern to match generic quantities.
    • It'll match decimal values from 0 and above with two decimal places optional.
  4. We've also edited our Prefix and Suffix Patterns so that the pattern must be surrounded by a space character before and after, with an optional dollar sign before the number.
  5. As you can see, we get five results below the "Quantity" label.
    • When we assign this Value Reader to the "Quantity" Data Column, we should then get five rows when this table extracts.


We do get a bunch of other hits as well. This is a very generic extractor matching very generic numerical data.

  1. Will this result present a problem? Will we get an extra row for its result?
    • No. That result is above the header label HRS / QTY.
    • The Tabular Layout method presumes rows are below column labels. Any and all results above the first instance of the column's headers will be ignored.
  2. What about these matching results on the same line? Will the extra results create additional row instances?
    • No. These results are misaligned with the "Quantity" Data Column's header. They are too far to the right to be considered under the column header. They will be ignored.
    • Only results aligned with the "Quantity" Data Column's header will create a row instance.
  3. What about these results? Will they produce a row?
    • No. These results are also misaligned with the "Quantity" Data Column's header.
    • That said, if these were aligned with the "Quantity" Data Column's header, they would produce row instances.
    • When you are building your own Data Column extractors, pay close attention to results below the column's header. They have the most potential to produce false positive results, producing erroneous rows.
      • That said, there are a multitude of ways to avoid false positive row results when using Data Columns' Value Extractors to detect rows. We will discuss this more in the #Advanced Setup Considerations portion of this article.


With our extractor ready to go, all we need to do is assign it to the "Quantity" Data Column using its Value Extractor property.

  1. Select the Data Column you wish to configure.
    • In our case, we want to configure the "Quantity" Data Column.
  2. Configure the Value Extractor property.
    • In our case, we've referenced our Value Reader designed to return generic numeric values.


FYI

At bare minimum you must configure at least one Data Column's Value Extractor to perform row detection.

However, multiple columns may be used to perform row detection by configuring their corresponding Data Columns Value Extractor properties. For more information on using multiple columns in row detection (as well as row detection in general) please visit the #Advanced Row Detection section of this article.

So far, we have:

  1. Collected labels for the Data Column labels (and optionally the header row label for the Data Table)
  2. Configured at least one Data Column with its Value Extractor configured.

For fairly simple table structures, we now have the two things the Tabular Layout method needs to extract data. Now, all we need to do is tell the Data Table object we want to use the Tabular Layout method. We do this by setting its Extract Method property to Tabular Layout.

3. Set Extract Method to Tabular Layout

A Data Table's extraction method is set using the Extract Method property. To enable the Tabular Layout method, do the following.

  1. Select a Data Table object in your Data Model.
    • Here, we've selected the "Line Items" Data Table.
  2. Select the Extract Method property.
  3. Using the dropdown menu, select Tabular Layout

4. Test

Now, let's test out what we have and see what we get!

  1. For the selected document folder in the "Batch Viewer" window...
  2. Press the "Test Extraction" button.
  3. The results show up in the "Data Element Preview" window.
    • Success! Our table's data is collected!


So, how was Grooper able to do this? For the Tabular Layout method, the Data Table is populated using primarily two pieces of information: column header locations established by the Data Columns' labels and rows locations detected by a Data Column's Value Extractor.

  • Remember, we collected labels for all Data Columns. We configured only the "Quantity" Data Column's Value Extractor'.

First, it's all about establishing column headers.

  1. The Data Columns' labels established the column locations for each column.
  2. Grooper then determines the width of these columns.
    • If table lines are present, Grooper can detect those line locations via a Line Detection (or Line Removal) IP Command. Grooper will "snap" the column's width to the detected line boundaries, expanding the cell's width (and height) to the boundaries around it.
      • Table lines give human readers an indicator of where the data "lives" (or is contained). If it's in the box, it belongs to the column. If it's out of the box, it belongs to a different column.
    • If table lines are not present (as is the case for this document), Grooper performs a variety of gutter-detection operations, analyzing the whitespace between columns to determine their widths.


Second, it's all about detecting rows. Rows are detected using a Data Column's Value Extractor.

  • In our case, we configured the "Quantity" Data Column's Value Extractor.
  • FYI: When a Data Column's extractor is used to detect rows, it is considered "Primary Extraction". A Data Column's extractor can also be used for "Secondary Extraction", performed after rows are detected. For more on this, please visit the #Primary VS Secondary Extraction section of this article.
  1. Rows are only detected below the detecting Data Column's header.
  2. Grooper runs the detecting Data Column's Value Extractor, looking for matching results aligned below the column header.
  3. For each result returned, Grooper establishes one row instance.
    • Since our extractor was designed to return decimal values, and Grooper found five decimal values below our column header, Grooper detected five rows.


The Tabular Layout method now has the two pieces of information it needs to determine the table's structure. If you know where the columns are and how big they are, and you know how many rows there are, you pretty much know what the table looks like. Grooper can infer the table's grid-like structure using the column and row positions.

  1. It has column instances for each Data Column.
    • Again, established by each Data Column's label.
  2. It has row instances for each detected row.
    • Again, established by the detecting Data Column's Value Extractor.
      • FYI: More than one Data Column can be used to detect rows. Please visit the #Advanced Row Detection section for more information.


With these column and row instances established, Grooper can form data instances for each cell of the table.

  1. Each cell's data simply lays where the columns and rows intersect.
    • For Data Columns with their Value Extractors configured, values are either collected using "Primary" or "Secondary Extraction". Please see the #Primary VS Secondary Extraction portion for more information.
    • For Data Columns without their Value Extractors configured, values are collected by returning the OCR or native text data within the geometric boundaries of the cell.
      • This is extremely beneficial for data that is difficult to extract using pattern matching.
      • For example, invoice item numbers and descriptions are notoriously difficult to pattern match. By using something in the table that is easy to pattern match, like our item quantities, we can use Tabular Layout to model the table structure and collect the other column values that are not.

Label Padding

When collecting labels for Data Columns the physical width of the label will help establish the width of the column. Grooper uses a variety of information on the page such as distance between column labels, whitespace gutters between the text in columns, line location data stored to a page's layout data to establish the width of a column.

However, Grooper doesn't always get things right. In these cases, you can manually adjust the width of a column using the Padding properties of the Data Column's Header label.


For example, take this line items table. Imagine we're using the "Line Total" column for row detection.

  1. If the column instance is limited to the width of label Line Total, the "Line Total" Data Column's extractor will never return a result. No text falls within the boundaries of the column.
  2. The values for the column are misaligned with the columns header.


Under normal circumstances, we simply couldn't use this column for row detection.

  1. However, using the Padding property, we can adjust the size of a Data Element's label (in this case the Data Column's Header label).
  2. This will adjust the width of the column instance, aligning the column's values within the boundaries of the column, allowing this column to be used for row detection.


  1. To adjust a label's Padding, first select the label whose width and/or height you wish to adjust.
    • We have selected the "Anfoneb" Document Type's "Line Total" Data Column's label.
  2. In our case we want to lengthen this Line Total label.
    • This will lengthen our column width, allowing the Line Total column's values to be used for row detection.


  1. Expand the Padding property.
  2. Use the Left, Right, Top, and/or Bottom properties to adjust the size of the label.
  3. We entered 0.5in for the Right padding property.
    • This extended the width of our label 0.5 inches to the right.
  4. Our line total values now fall below the "Line Items" label. The "Line Items" column can now be used for row detection.


  1. Success! Now that we adjusted the width of our "Line Items" Data Column's label, the table extracts successfully.

FYI

You may have noticed we did not pad the label to reach the true "end" of column. Rather, the width just barely overlapped with the currency values in the column.

We were able to get away with this because we were using the column for row detection. The "Line Items" Data Column's extractor was using Primary Extraction to find these values, collect them, and detect rows all at the same time.

Were this column using Secondary Extraction to collect the columns values, it's most likely we would need to further pad out the column header so that it does extend the full width of the column.

• For more information on row detection, please visit the #Advanced Row Detection portion of this article.
• For more information on Primary and Secondary Extraction, please visit the #Primary VS Secondary Extraction portion of this article.


Table Labels and Labelset Based Classification

Table headers are often very useful (even critical) for Labelset-Based classification, and it generally is the case you want to use them as a classification feature. Currently, if you want to use a Data Table object's labels for classification, you must set the Data Table's Minimum Row Count property to at least "1". This is a known issue in the current version of Grooper and likely will change.


However, if you find Data Table and/or Data Column labels are not included in determining document similarity during classification, do the following:

  1. Navigate to the Data Table object in the Node Tree.
  2. Expand the Row Count Range property.
  3. Select the Minimum property.
  4. Enter 1.

If you have multiple Data Table objects in your Data Model, you will need to repeat these steps for each one.


For more information on the Labelset-Based document classification method, visit the Label Sets article.

Advanced Setup Considerations

The Tabular Layout method is designed to extract tabular data even with the most basic setup described above. However, sometimes "basic" just isn't enough.

The challenging part of table extraction is the variety of forms a table can take. Columns can be in various orders. Table cells can be spaced well apart or jam-packed tight together. Sometimes data is required to be present for some table formats but it's optional on others. There's little consistency in how columns are labeled. Multiline row data can be challenging to target.

Grooper's Tabular Layout method has ways to overcome these issues, and more. For more complicated table structures, the Tabular Layout method has a robust suite of configurable properties. Understanding these properties will allow you to better extract a wider variety of tabular data.

In this section, we will discus the following advanced setup features for Tabular Layout:

  1. #Multiline Rows
  2. #Advanced Row Detection
  3. #Primary VS Secondary Extraction
  4. #Footer Detection

For the following tutorials, you may presume the following unless otherwise told:

  • We will continue testing table extraction using the "Line Items" Data Table from the #Basic Setup instructions.
  • Column headers have already been established (either using Label Sets or Header Extractors)
  • The "Quantity" Data Column is performing row detection. It's Value Extractor has been configured as described in the #Basic Setup
  • Line location layout data has been collected for all documents.

Multiline Rows

For many documents, the data in each row of a table occupies a single line.

The table we used in our #Basic Setup instructions had single-line rows. Indeed, single-line table structures are more basic and are typically the easiest to extract.


Multiline table structures are a little trickier.

In multiline tables, the data in one or more columns can span multiple lines. For example, the "Description" column in this table spans multiple lines (four to be exact).

This can pose a challenge for table extraction, particularly for tables with unpredictable line wrapping where sometimes a row may be single-line and others may be multiline.


But, have no fear! The Tabular Layout method can easily detect most multiline table structures by enabling the Multiline Rows property.


The default Tabular Layout settings presume all rows are single-line.

  1. This "Rechnung" Document Type has a multiline table.
  2. Upon testing extraction, note only the first line for each row in the "Description" column is collected.
  3. The remaining three lines in the "Description" cells are ignored.


This is what the Multiline Rows property is for. Enabling this property will allow you to target table structures like this whose rows extend beyond just a single line on the page.

  1. To enable Multiline Rows, first expand the Tabular Layout sub-properties.
  2. Switch the Multiline Rows property to Enabled.


  1. The Tabular Layout method now appropriately detects the rows occupy multiple lines on the document.
  2. The full line item description is now properly extracted by the Data Table.


The Multiline Rows functionality will even detect multiline rows if the lines start on one page and continue to the next.

  1. Make sure Multiline Rows is enabled.
  2. In the subproperties of Multiline Rows, set the Detect Page Wrap property to true.

Detect Stacked Layout

There is a special variety of multiline structured tables called a "stacked layout" table. In these tables, you will find two different pieces of information stacked on top of one another in the same column.


For example, in this table, the "Item Number" and "Description" column headers are both contained within the same column, with "Item Number" and stacked on top of "Description".

  • "Item Number" is highlighted in orange.
  • "Description" is highlighted in yellow.


Their corresponding values are also stacked on top of each other in each row. The item numbers in each row are stacked on top of the description from that item.

  • The item number values are highlighted in orange.
  • The item description values are highlighted in yellow.


In these situations, the Detect Stacked Layout property can help get the right values in the right columns with no additional extraction configuration.


With Multiline Rows enabled, you can choose to enable or disable the Detect Stacked Layout property.

  1. Detect Stacked Layout is Disabled by default.


Here, we are using the default configuration with Multiline Rows enabled.

  1. The "Envoy" Document Type is a good candidate for the Detect Stacked Layout feature.
  2. We've collected the header Item Number for the "Item Number" Data Column
  3. We've collected the header Description for the "Description" Data Column


These two header labels are stacked on top of each other, as is their data in each row.


Without Detect Stacked Layout enabled, we've got some problems.

  1. This is the normal Multiline Rows behavior.
    • Grooper determined correctly these rows spanned multiple lines. The cell is populated with all lines.
    • However, this is not what we want.
  2. For each row, the first line (and only the first line) should be in to the "Item Number" column.
  3. And, the second line (and only the second line) should be in the "Description" column.


Because the "Item Number" header is stacked on top of the "Description" header, we can presume the first line belongs in the "Item Number" column and the second belongs in the "Description" column.


The Detect Stacked Layout property will put the data from the appropriate line into the appropriate column according to how the labels are stacked.

  1. To enable Detect Stacked Layout expand the Multiline Rows sub-properties.
  2. Change Detect Stacked Layout to True.


  1. Now, only the first line is collected for the "Item Number" column.
  2. And, only the second line is collected for the "Description" column.


FYI

This would have been a very good situation for Data Element Overrides. Indeed, given Tabular Layout's multitude of configuration options, most users will find themselves using multiple Document Types and Data Element Overrides to fine tune extraction logic based on a variety of table formats.

Given that this "Envoy" Document Type is the only one who can make use of the Detect Stacked Layout functionality, we really should have made this configuration using Data Element Overrides. This will prevent unintended consequences on other Document Types where the Detect Stacked Layout feature does not provide a benefit (or impedes accurate extraction).

We should have enabled Detect Stacked Layout as an override performing the following steps:

  1. Select the Document Type whose override you want to configure.
    • The "Envoy" Document Type in this case.
  2. Navigate to the "Overrides" tab.
  3. Select the Data Table
    • The "Line Items" Data Table in this case.
  4. Turn the Detect Stacked Layout property to True.

By enabling Detect Stacked Layout using the "Envoy" Document Type's overrides, it will ensure only document's classified as "Envoy" will use the configuration.

Advanced Row Detection

A Data Column's Value Extractor is going to extract data in one of two ways:

  1. Primary Extraction
    • Primary Extraction is for row detection. In this case, the extractor runs at the document level, looking for potential rows beneath the Data Column's header.
  2. Secondary Extraction
    • Secondary Extraction happens after rows are detected. After row instances are formed, After cell instances are formed. In this case, the extractor runs at the instance level to further parse table cell or row data.

This section is all about Primary Extraction (We'll talk more about the differences between Primary and Secondary Extraction in the #Primary VS Secondary Extraction section). This section is all about using Data Column extractors to locate and form row instances.

In the #Basic Setup section, we demonstrated a simple example of how a single Data Column's extractor detects rows. However, more complicated table structures require more complicated solutions.

In this section we will discuss:

Row Detection Using Multiple Columns

Going back to our #Basic Setup example: Why did we use the "Quantity" Data Column for row detection?

Simple enough answer: There were quantities present on every row. Plus, quantity values are a lot easier to pattern match than something like an item number or a description.

However, we could have used other columns for row detection. For example, you'd expect there to be a "Unit Price" or "Line Total" value in the rows of line item table as well. And, currency values are about as easy to pattern match as quantity values.

You can use not just one but multiple column values to form row instances. This can be an effective way to throw out false positive rows. Using multiple columns to detect rows, you're effectively saying you need a value present in Column A and Column B to detect a row.

You can use as many columns as you need to detect rows. You can configure table extraction so that a value would need to be present in Column A and Column B and Column C and so on.

You can also configure Tabular Layout in such a way that columns can be optionally used to detect rows. You might have a situation where as long as a value is present in Column A or Column B the row should be considered valid and detected.

In either case, when using multiple columns to detect rows the Minimum Cell Count property becomes extremely important. Once you're finished with this section, please be sure to read #The Minimum Cell Count Property section of this article for more information.


  1. For example, look at our initial results for this "Nama" Document Type.
  2. As far as the Tabular Layout settings go, we've enabled Multiline Rows and that's it.
    • However, Multiline Rows is agnostic to row detection. It has nothing to do with detecting rows, only enlarging them to include wrapped lines between detected rows.
  3. Using the "Quantity" column alone for row detection, we have collected a false-positive row instance.
    • This row is not a valid row. We need to throw it out.


Why did this happen? It's because we used the "Quantity" column to detect rows.

  1. The "Quantity" Data Column's Value Extractor is a very generic extractor.
    • It will match most numeric as well as currency values.
  2. When the extractor runs within the boundaries of the "Quantity" column, it certainly matches the three numeric quantities listed in the three table rows.
  3. However, it also matches this value below the table.
    • This is the result giving us the false positive row. Because the extractor returns a value within the boundaries of the detecting column, Grooper forms a row instance.


If we use multiple columns to detect rows, we can avoid this issue.

For this table, each row has both a "Unit Price" and a "Quantity" value in every row.

  • It's just the "Quantity" column giving us the issue on this document.
  • There is no matching false positive value in the "Unit Price" column.

If we used both columns to detect rows, we're effectively saying each row must have both a "Quantity" value and a "Unit Price" value to be considered valid.

  • Even though there is a matching "Quantity" result in the false-positive row, there is not a "Unit Price" result.
  • Therefore, if we use both columns to detect rows, the false-positive result would be thrown out.

Furthermore, for all (or certainly most) invoice table formats, we would expect both unit price values and quantity values listed for each row. Configuring two-column row detection would not only help detect rows for this table format in particular, it's likely to help detect rows from other formats as well.


  1. All we need to do is configure the "Unit Price" Data Column to perform row detection.
  2. We will configure its Value Extractor, referencing the Value Reader we saw earlier matching numeric/currency values.


  1. With both the "Quantity" and "Unit Price" Data Column's Value Extractor properties configured, a value is required in both columns for a row to be detected.
  2. This throws out our false-positive match from earlier, when only the "Quantity" Data Column's Value Extractor was configured.

The Minimum Cell Count Property

The Minimum Cell Count property is extremely important when using multiple columns to detect rows.

  1. In the Tabular Layout sub-properties, this property is located in Row Detection sub-properties.
  2. The Minimum Cell Count property's default value is 3.
    • This means a minimum of 3 columns values must be present in order to detect a row.
    • So, if you have 5 Data Columns whose Value Extractors are configured, only 3 of their values would need to be present to detect the row and form a row instance.


There is, however, a caveat if you have less than the minimum value of Data Columns with configured Value Extractors.

  1. For example, we currently only have two Data Columns with configured Value Extractors.
    • The "Quantity" and "Unit Price" Data Columns.
  2. Two is less than three (the default Minimum Cell Count).
  3. But, we're still collecting table data.

Since only two Data Columns' extractors are configured, we don't actually reach the "minimum" of "3". The Tabular Layout method will account for this and still extract the table data, presuming a value from the two columns must be present out of the three possible "minimum" cells.

  • It's when you go over the minimum cell count value in terms of the number of Data Columns with configured Value Extractors that this property really comes into play.


Next, we're going to look at the Minimum Cell Count property where the number of Data Columns with configured Value Extractors does exceed the minimum cell count value (or will eventually by the time we're done).

Correctly manipulating the Minimum Cell Count property can be critical to establishing your row detection logic.

  1. Let's look at the table from the "Factura" Document Type.
  2. This invoice should have four rows.
  3. However, as configured currently with the "Quantity" and "Unit Price" Data Columns performing row detection, we're only detecting two rows.


Furthermore, we've got another issue due to Multiline Rows being enabled.

  1. Our extended price ("Line Total") value for the first row should be "40,700.00" not "40,700.000.00"
  2. This cell is consuming the "0.00" text from what should be the second row.


All of this can be resolved with better row detection.


  1. First, lets fix the problem with the "Line Total" Data Column's value.
  2. If we configure the "Line Total" Data Column's Value Extractor, it will match the dollar amount in this row properly.
  3. Here, we've configured the Value Extractor property to reference that same Value Reader matching numeric/currency amounts.


Think about it. This is the text cell extracted for the extended price.

40,700.00
0.00

This is not a valid currency value (Or technically, it's two currency values stacked on top of each other).

This, however, is a valid currency value:

40,700.00

By configuring the "Line Total" Data Column's extractor, we've added one more rule to detect valid rows. In order for a row to be detected, all the following conditions must be met:

  • You must have a matching result in the "Quantity" column.
  • You must have a matching result in the "Unit Price" column.
  • You must have a matching result in the "Line Total" column.


  1. Now we've extracted the correct value for the "Line Total".
  2. However, we're still only returning two rows.
  3. We're going to use the Minimum Cell Count Property to fix this.


Because we now have three Data Columns whose Value Extractors are configured, we have met the met the minimum cell count of "3".

That means only the two rows where a value from the "Quantity", "Unit Price" and "Line Total" columns are present are being detected as valid rows.


The truth is this table structure is a little non-standard in two ways.

  • Whereas this table lists a zero dollar amount in the "Line Total" (Extended Price) column, it leaves the cell blank in the "Unit Price" column. Since there's no value there, Grooper passes it over for row detection.
  • While the shipping cost is listed in the table for this invoice, the "Quantity" (Qty Shp.) is left blank.

In both cases, one of the three column values required for detection are missing. However, in all cases two of the three values are present for each row. We can use the Minimum Row Count property to change our detection logic a bit.


  1. We can successfully extract every row in this table by dropping the Minimum Cell Count value to 2.
    • Remember, we have three Data Columns' extractors configured, meaning three can potentially be used to detect rows.
    • With the Minimum Cell Count set to 3, all three values from all three columns must be present to detect a row.
    • By dropping it to 2, only two of the Value Extractors from configured Data Columns must return values to detect a row.
      • A row with a "Quantity" value and a "Unit Price" value would be detected.
      • A row with a "Unit Price" value and a "Line Total" value would be detected.
      • A row with a "Quantity" value and a "Line Total" value would be detected.
      • A row with a "Quantity" value, a "Unit Price" value, and a "Line Total" value would be detected.
      • A row with a "Quantity" value alone? Nope. Not a valid row. Doesn't meet the minimum of "2".
  2. With this change to our row detection logic, all four rows are collected.


FYI

This would be another good example of when to implement Tabular Layout adjustments via Data Element Overrides, rather than using the globally extracted Data Table.

For most of our Document Types in this set, using our three Data Column extractors and a Minimum Cell Count of 3 actually works really well as far as row detection goes.

• The "Factura" Document Type doesn't fit the normal model. It works better with a Minimum Cell Count of 2.
• Therefore, the adjustment to the Minimum Cell Count should be made in the "Factura" Document Type's overrides.

Row Detection Limitations with Multiline Rows

There is one strict limitation to Grooper's row detection when you're dealing with multiline rows. In order to detect a row, ALL values must be present on the same line.

Tables with multiline rows generally exist in two flavors (or a Neapolitan combination of the two):


  1. Rows are multiline because the text within a cell wraps to the next line.


  1. Rows are multiline because the columns have a stacked layout.


There's a variety of ways Grooper handles stacked column data in multiline rows. We've already seen the Multiline Rows feature's Detect Stacked Layout option (See here for more details).

  • FYI: We'll see more ways to handle data stacked within a table cell in the Secondary Extraction portion of this article.


However, you should always keep in mind the Multiline Rows feature has absolutely nothing to do with detecting rows. Grooper must detect a row first before it implements the Multiline Row logic to expand the row instance across multiple lines of text. For tables with a stacked column layout, row detection can prove challenging if you are using multiple Data Columns to detect rows using data on separate lines.

  • In order to detect a row, ALL values must be present on the same line.


For example, take this invoice line items table format with stacked columns.

In our Data Table, three of our Data Columns extractors are performing row detection.

  1. The "Quantity" Data Column, labeled as QUANTITY here.
  2. The "Unit Price" Data Column, labeled as UNIT PRICE here.
  3. The "Line Total" Data Column, labeled as TOTAL


The problem, as far as row detection goes, is two of these column values are on the same line, but one is on a separate line.

  • The "Quantity" and "Line Total" values are on the first line of the row.
  • The "Unit Price" value is on the second line of the row.


Grooper will not be able to detect rows (and therefore won't collect table data) as we have Tabular Layout configured currently.


If we try to extract this table, as configured, we will get no results whatsoever (because no rows are detected).

  1. Testing extraction.
  2. We get no result.
  3. FYI Enabling Multiline Rows has nothing to do with row detection.
  4. FYI Enabling Detect Stacked Layout has nothing to do with row detection.
    • These properties will be helpful in modeling the row structure, but won't do anything if we're not detecting rows in the first place!


How are we going to fix this? There's two ways we could approach this problem:

  1. By adjusting the Row Detection > Minimum Cell Count property.
    • As we've seen before, when you adjust this property, such that the number is less than the number of Data Columns with configured Value Extractors, it makes Data Columns optional when it comes to row detection.
    • If we lowered this to 2, only two of our three columns would be required for row detection. The "Quantity" and "Line Total" columns' values are on the same line. Therefore, we would detect our rows.
  2. By disabling row detection for the "Unit Price" Data Column.
    • This may sound whacky, but it will be highly effective for our situation here. What's the problem here? Row detection due to a stacked column layout. Specifically, one Data Column's value is on the second line of the row (the "Unit Price" column).
    • However, we have data we can use for detection on the first line (the "Quantity" and "Line Total" columns).
    • All we have to do is tell Tabular Layout, "Don't use the "Unit Price" column's extractor to detect rows.", and we will start to collect our table data.
      • FYI: You might already be asking yourself "If we disable the column's extractor for row detection, why don't we just remove it?" That's because we are going to use it. For Secondary Extraction. After we talk about disabling a Data Column's extractor for row detection, this will lead us into a discussion about Tabular Layout's Secondary Extraction capabilities in the #Primary VS Secondary Extraction section.

Disabling Row Detection

The previous example is a good one to point out how to disable row detection for a specific Data Column (and why you'd want to in the first place).


To recap:

This table presents a problem for row detection due to its stacked column layout.

  • Our Data Table's "Quantity" "Line Total" and "Unit Price" Data Columns are configured to perform row detection.
  • The "Quantity" and "Line Total" values exist on the first line of each row.
  • Whereas, the "Unit Price" values exist on the second line of each row.

Because the values exist on different lines, Tabular Layout' cannot detect the rows.


However, if we only used the "Quantity" and "Line Total" columns to detect rows, we would have no issue.

  • The "Quantity" and "Line Total" Data Columns' Value Extractor configurations would detect the rows.
  • With Multiline Rows enabled, the detected row would then be extended to capture the second line.


All we need to do is disable row detection for the "Unit Price" Data Column, using the Tabular Layout method's Column Settings properties.


Generally speaking, once you start configuring the Column Settings properties, you're doing so because you have a large number of table formats represented by a large number of Document Types. In most cases, you will adjust these properties per Document Type using Data Element Overrides.

Going forward, when adjusting the Column Settings in this tutorial, we will do so using a Document Type's overrides instead of configuring the global Data Table object.


  1. We will demonstrate disabling row detection by disabling the "Unit Price" Data Column's row detection for the "Daftari" Document Type.
  2. Navigate to the "Overrides" tab to override the Data Table's configuration for the selected Document Type.
  3. Select the Data Table.
  4. Expand the Tabular Layout sub-properties.
  5. Select the Column Settings property.
  6. Press the ellipsis button at the end of the property.


  1. This will bring up the Column Settings editor.
  2. The Column column lists the Data Columns in your Data Table. Select the Data Column you wish to configure.
    • In our case we want to disable row detection for the "Unit Price" Data Column.
  3. To disable row detection for the selected Data Column, change the Row Detection property to Disabled.
    • This will prevent the Data Column's Value Extractor from performing Primary Extraction, forcing it to use Secondary Extraction instead. For more on Secondary Extraction, visit the #Primary VS Secondary Extraction portion of the article.
  4. Press OK when finished.


  1. With Row Detection Disabled for the "Unit Price" Data Column in the Column Settings, Grooper can now detect rows for this table format.
  2. Grooper successfully detects the three rows present on the document.
  3. There is however an issue with the extracted data in our "Unit Price" Data Column.
  4. The entire cell's text is collected, not the unit price listed inside the cell.
    • This is at least a better problem than the one we had before.
    • Previously, we weren't getting any data for any columns in any rows.
    • Now, we're at least have row instances to work with and we're getting most of our table data. Furthermore, the data we want is contained within the cell. We just need a way of extracting it.
      • With the data we want present in each cell, we can extract the data (the unit price currency listed) using Secondary Extraction.


FYI

The Column Settings > Row Detection property can be set to one of the following values:

Optional
Required
Disabled

Optional is the default setting. This means the Data Column will be used for row detection, but is not required.

• Imagine your Data Table's Row Detection > Minimum Cell Count property is set to 3 and you have 5 Data Columns whos Column Settings > Row Detection properties are set to Optional.
• If all five of those Data Columns extractors produced results on a line, the row would be detected.
• If any two of those Data Columns extractors failed to produce a result, but the other three did return a result, the row would still be detected.
• An optional Data Column can potentially be used for row detection, but if it fails to return a value, the row can still be detected. As long as enough other Data Columns produce results (such that the number of Data Columns returning a result meets the Minimum Cell Count value), the row will be detected.
• Refer to this section of the article for more information on how the minimum cell count effects row detection.

Required will strictly force a Data Column to be used to for row detection.

• Imagine your Data Table's Row Detection > Minimum Cell Count property is set to 3 and you have 4 Data Columns whos Column Settings > Row Detection are Optional, but one ("Column A") is set to Required.
• If all five of those Data Columns extractors produced results on a line, the row would be detected.
• If two of the optional Data Columns fail to produce a result, but the required Data Column and remaining two Data Columns do, the row would be detected.
• If four of the optional Data Columns extractors produced results, but the required Data Column's extractor did not, no row would be detected.
• The required Data Column(s) must return results in order to detect a row.

Disabled will exempt a Data Column from row detection.

• Instead of using Primary Extraction, it will use Secondary Extraction.
• We will discuss Secondary Extraction in the next section of this article.

Primary VS Secondary Extraction

Primary Extraction and Secondary Extraction refers to how a Data Column's Value Extractor extracts table data.

There are three things you need to be clear on to understand the differences between Primary Extraction and Secondary Extraction.

  1. data instances
  2. What a data instance is
  3. DATA INSTANCES


It really all boils down to data instances. The Tabular Layout method subdivides a table into data instances in a variety of ways: first into column instances, second into row instances and third into cell instances. At the end of the process, Grooper has everything it needs to collect data using these sub-instances.


For Primary Extraction, the Data Column's extractor executes within the column instance.

  • Primary Extraction is utilized for row detection, which is the process of forming row instances.


For Secondary Extraction, data is collected from the table using the instances established after rows are detected. This is done in one of following ways:

  • The Data Column's extractor executes within the cell instance.
    • Secondary Extraction is employed to parse data within a cell, after rows are detected and the table's structure is established.
  • The entire text within the cell is collected.
    • When Secondary Extraction isn't used to parse data within a cell, Secondary Extraction can simply collect all data for the cell instance.
  • Less commonly, the Data Column's extractor executes within the whole row instance.
    • Secondary Extraction can also be configured in such a way that extraction occurs at the row-level rather than the cell-level.


Secondary Extraction is useful for further parsing table data once rows have already been detected and cell and row instances are formed.

For example, we had an issue in the previous section where rows were detected but column's value were not extracted correctly.

Due to an issue with the table's stacked column structure, we couldn't use the "Unit Price" Data Column for row detection. So, we disabled row detection for that column in the Column Settings properties. This prevented the Data Column from performing Primary Extraction.

Instead, it is falling back on Secondary Extraction.

Secondary Extract will attempt to execute the Data Column's Value Extractor inside the cell instance rather than the column instance. If that extractor fails to return a result, the entire text within the geometric boundaries of the cell is returned instead.

Currently, we're simply returning all the text within each cell for each cell for each row for the "Unit Price" column. This isn't what we want to collect.

However, the value we do want (the dollar amount) is fully encapsulated within the cell. We just need to extract it from the text present in the cell.


  1. The "Unit Price" Data Column does currently have its Value Extractor configured.
  2. It's using our same generic numeric/currency extractor we've been using through this article to match numeric values.
    • All we need to do is ensure this extractor can property extract data from the cell.


This brings up a common issue when performing Secondary Extraction. Always be aware of the instance-level you are extracting.

  1. This is the extractor the "Unit Price" Data Column references.
  2. It certainly seems like it's matching the dollar amounts in the "Unit Price" column.
  3. However, there is an issue with this Suffix Pattern when the extractor runs Secondary Extraction in the "Unit Price" column's cells.

When run globally on the document, it would make sense to expect a space character after the number. However, when you get down to the cell instances for the "Unit Price" column, there is no space character.


The Suffix Pattern doesn't match within the cell. Instead of there being a space character present, there's just nothing. The text data terminates at end of the number itself. When run using Secondary Extraction, this extractor fails to produce a result.


We just need to update this extractor so that it will match within the cell during Secondary Extraction.

  1. In this case, we just need to ensure our numeric regex pattern will match whenever it is suffixed by a space character or the end of string anchor character $
    • \s|$
  2. FYI It's very common to use end of string characters $ in your Suffix Patterns as well as beginning of string characters ^ in your Prefix Patterns when relying on Secondary Extraction.


With this minor change to the extraction logic, the extractor will now property execute with in the cell whenever Secondary Extraction is performed.

  1. After testing extraction, you can see we are accurately extracting the values in each row for the "Unit Price" column.
  2. The "Unit Price" Data Column's Value Extractor runs during Secondary Extraction, executing against the cell instance after rows are detected.
    • Now that we adjusted the extractor to match within the cell instance, we get the value we want.

Secondary Extract Modes

There are three ways in which Secondary Extraction can be performed, called Secondary Extract Modes. These modes can be configured Data Column by Data Column using the Tabular Layout > Column Settings > Secondary Extract Mode settings.

  1. Cell Extract
    • For the Cell Extract mode, the Data Column's Value Extractor executes within the table cell's text contents.
    • This is useful to parse a smaller amount of data from a larger amount of data within a table cell.
    • Or, you may use an extractor to manipulate text within a cell, such as to cleanse the data using Fuzzy RegEx.
  2. Geometric
    • The Geometric mode extracts all text within the physical boundaries of the cell.
    • Data Columns with no Value Extractor configured are using the Geometric method to collect data for the cell.
    • This is useful to collect data that is difficult to pattern match.
  3. Row Extract
    • The Row Extract mode executes the Data Column's extractor against the full text of the row instance (not the cell instance).
    • This is the least common Secondary Extract Mode. Typically, this mode is used as a last resort due to atypical table structures.

Auto VS Cell Extract VS Geometric

The default value for the Secondary Extract Mode is Auto. "Auto" will attempt to use the Cell Extract mode, but will fall back on the Geometric mode as a failsafe.

  • Auto first attempts to use Cell Extract. If the Data Column's extractor returns a match within the cell, its result will be returned.
  • If the Data Column's extractor fails to return a match within the cell, Auto will use the Geometric mode. All text within the geometric boundaries of the cell will be returned.

This is exactly what happened in our previous example.


At first, the text data was returned using Geometric mode.

  1. The "Unit Price" Data Column's extractor executes against the cell.
  2. The extractor did not match anything in the cell's text data.
  3. So, Geometric mode was used, returning all text within the physical boundaries of the cell.


Then, we fixed the "Unit Price" Data Column's extractor so it would match within the cell.

  1. The "Unit Price" Data Column's extractor executes against the cell.
  2. The extractor does match text in the cell.
  3. So, Cell Extract mode was used, returning only the extracted result.


You can, however, force a Data Column to only ever use either Cell Extract or Geometric by configuring the Table Extraction > Column Settings > Secondary Extract mode property for one or more Data Columns.

Row Extract

The Row Extract mode allows you to execute a Data Column's extractor against the row instance rather than the cell instance. There are two main reasons to do this:

  1. The table's structure is atypical and Grooper was not able to appropriately find the divisions between columns.
  2. You need to extract data that is in each row but not labeled by a column header.

In either case, it may be difficult (or even impossible) to extract the data you want out of a specific cell within a row. However, it may be possible to extract the data from the row itself.


For example, imagine we wanted to find the "Unit" column for our line item tables as well.

The unit "EA" is listed clearly for each item in the row. However, there is no column header labeling this column. There's nothing like "Unit" or "Unit of Measure" or "UOM" we present labeling the column.


Furthermore, because this table has line layout data, neither the "Unit Price" nor the "Line Total" columns would ever contain this value within their cell instances for any row.

  • Sometimes you can get away with using a different column's header label, even using the same one that's already been used by another Data Column. This will not be the case here.


However, the data is always present in each row, and Grooper easily detects each row in this table.

We can still extract the unit of measure from the row instance, using the Row Extract Secondary Extract Mode.


Next, we're going to configure Tabular Layout so that the "Racun" Document Type will use the Row Extract mode to extract the unit of measure value from each row in its invoices' line items tables.

  1. We have created a Value Reader to match units of measure.
  2. This is a very basic List Match extractor, matching common units like "each" or "EA".
  3. We have added a "Units" Data Column and assigned this Value Reader as its Value Extractor.
  4. Ultimately, we will use this to extract the unit values from each row instance.


If we test extraction against our sample document, we will get everything but the "Unit" column.

  • Commonly, you will configure Secondary Extract Modes as override changes for a Document Type, which is what we're choosing to do here.
  1. We've selected the "Racun" Document Type.
  2. We've navigated to the "Overrides" tab.
  3. Testing out extraction, we have nothing populated for the "Unit" column.
    • This shouldn't be surprising. We have not established a column header for this Data Column because there is no header label to collect!
  4. We will use the 'Column Settings' properties to force the "Unit" Data Column to perform Secondary Extraction, using the Row Extract mode.


  1. Select the Data Column you wish to configure.
    • The "Unit" Data Column, in our case.
  2. To enable the Row Extract mode, change the Secondary Extract Mode to RowExtract.
  3. Press OK when finished.

FYI

You should also consider editing the Row Detection and Secondary Extract properties at this point.

Are you ever going to use this column to detect a row for the Document Type? NO

• You should set the Row Detection property to Disabled in this case.

Do you always expect to use the Row Extract mode to find the units for this Document Type? YES

• You should set the Secondary Extract property to Always in this case.


  1. With the Row Extract mode enabled, we collect unit values for the "Unit" column.
  2. The "Unit" Data Column's extractor now executes against each full row, when Secondary Extraction is performed.
  3. Click the Inspect button before moving on.


FYI

The Instance Viewer is a tool to better understand the instances created and used in table extraction.

The Instance Viewer can be extremely beneficial when configuring Secondary Extraction. Whether you're fine tuning Cell Extract, trying get a closer look at the Geometric text data, or trying to set up a Row Extract extractor, the Instance Viewer will be your best friend.

  1. Expand your Data Table to view the various instances created during table extraction.
  2. The first level in the hierarchy will be row instances.
  3. The next level in the hierarchy will be cell instances.
  4. The "Image View" tab will highlight the selected instance's physical location on the document.

Footer Detection

A "footer" is a text table that indicates where the table stops. Some tables will have footers, and some won't. When the table does have a footer, Grooper can use this information to force-stop row detection.


For example, as our Data Table using Tabular Layout is configured currently, we've collected one row we shouldn't have for the "Sonrasc" Document Type's line items table.

  1. There are only actually two rows on this document.
  2. However, we collected three.

Why? The "Quantity", "Unit Price" and "Line Total" Data Columns are all being utilized for row detection.

  1. We have three matching results for all three of those columns on this line.
    • As far as Tabular Layout is concerned, this counts as a row.


We will fix this issue using a footer. By defining a footer for this table, we can dictate where the table should end based on static text labels on the document.

  1. For example, this phrase "THANK YOU FOR YOUR ORDER" is always found at the end of the line items table from this vendor.
    • Once we assign this phrase as the table's footer, Grooper will stop detecting rows once it reaches this point.

You can define a footer in one of two ways:

  • Using an extractor, by configuring the Tabular Layout > Footer Detection property.
  • Using Label Sets, by collecting the Data Table's Footer label for one or more Document Types.

Collecting a Footer Using an Extractor

To establish a footer, using an extractor, you will configure the Footer Detection property.

  1. Select the Data Table you wish to configure.
  2. Expand the Tabular Layout sub-properties.
  3. Select the Footer Detection property.
  4. Using the dropdown list, select the extractor (Extractor Node or Value Extractor) you wish to use.
    • We're going with a List Match extractor for this tutorial.


Configure the extractor to match something at the foot of the table.

  1. Our list entry will be THANK YOU FOR YOUR ORDER
  2. This will match something on the document that's found at the end of this table (at least for this vendor).


  1. With the Footer Detection extractor configured, we will throw out the false positive row detected after our footer result.
  2. This row is after our footer. So, it is no longer detected.
  3. The two valid rows are collected accurately.

FYI

Keep in mind the Footer Detection property is a global property. It will be applied to all Document Types (unless overridden using Data Element Overrides).

Collecting a Footer Using Label Sets

Table footers can be established using Label Sets by collecting a Footer label for the Data Table.

  1. Navigate to the "Labels" tab of your Content Model.
  2. Select a sample document assigned the Document Type whose labels you want to collect.
    • Or manually assign it the Document Type if not done so already.
  3. Select the Data Table in the list of Data Elements.
  4. Select the Footer tab.
  5. Collect the text label you wish to use as the footer.
    • In this case we've lassoed the text THANK YOU FOR YOUR ORDER.
  6. Don't forget to save when finished.


  1. With the Data Table's Footer label collected for this Document Type, we will throw out the false positive row detected after our footer result.
  2. This row is after our footer. So, it is no longer detected.
  3. The two valid rows are collected accurately.
  4. There is no need to configure the Footer Detection property when using Label Sets.
    • The collected Footer label effectively supplants the Footer Detection property.

FYI

The Label Set approach is, in general, a more "templated" approach. You will need to collect a Footer label for each Document Type that needs one.

Capture Footer Row VS Display Total Row

FYI: The Capture Footer Row property was introduced in version 2021.0046. Earlier minor versions do not have this property.

The Capture Footer Row property creates a row instance at the bottom of the table, using the footer to establish the row.

  • This row is ONLY for the benefit of a document reviewer. This data IS NOT actually collected as part of the table's data.


  1. First Grooper will locate the footer.
    • In this case we used a Footer label SUBTOTAL
  2. Then, Grooper will create a row instance using the footer, instead of Tabular Layout's normal row detection methods.
  3. This is now a row instance. If there is anything that can be extracted by a Data Column's extractor, it will be.
    • In our case, the "Line Total" Data Column's extractor returned the numerical value in this row.
  4. Extracted values are then displayed in the "footer row" at the bottom of the table.

Values in these footer rows may be useful for your data reviewers. Often there are column totals that can be extracted from a footer row and used to validate information in the table rows above it.


Be aware, the Capture Footer Row is enabled by default.

  1. The Capture Footer Row is set to True by default.
  2. You will need to set this to False if you do not want to display the footer row when reviewing Tabular Layout's extraction results.


Please be aware the Capture Footer Row is in some ways similar to the Display Total Row feature, but is exceptionally different from it in one major way.

  • The Capture Footer Row creates a row instance that is actually extracted against, using the document's text data.
    • The data is generated during and as a part of extraction.
  • The Display Total Row displays a row, adding up numerical values collected for one or more columns.
    • The data is generated after extraction.


The Display Total Rows feature adds a row to the bottom of the table using solely a mathematical operation.

  1. No document extraction is performed to populate the row.
  2. Instead all the column values for one or more defined Total Columns are added together.
  3. The result is displayed in the "total row" at the bottom of the table.


The Display Total Rows feature is also useful for document reviewers. It just gets its results differently than the Capture Footer Row feature.


Capture Footer Row will supersede the Display Total Rows if both are enabled.

  1. For this Document Type the following Footer label was collected:
    • SHIP Shipping
    • The idea being the shipping value would always be listed on the last line of the table and should not be collected as part of the line items table data.
  2. The Capture Footer Row property is set to True
  3. The footer row instance is generated using the Footer Label.
  4. Extracted values are displayed in the "footer row" at the bottom of the table.
  5. However, Display Total Row property is also set to True.


You can only have either a "footer row" or a "total row", not both.

  • If both properties are enabled (set to True), Capture Footer Row takes priority and a "footer row" will be displayed.
  • If you want to display a "total row", you must set the Capture Footer Row property to False.


  1. When using Generate Footer Row be sure to select a Data Column...
  2. ... and set the Footer Mode property to Calculate.