Tabular Layout (Table Extract Method)

From Grooper Wiki

This article was migrated from an older version and has not been updated for the current version of Grooper.

This tag will be removed upon article review and update.

This article is about the current version of Grooper.

Note that some content may still need to be updated.

20252024 20232021

The Tabular Layout Table Extract Method uses column header values determined by the view_column Data Columns Header Extractor results (or labels collected for the Data Columns when a Labeling Behavior is enabled) as well as Data Column Value Extractor results to model a table's structure and return its values.

The Tabular Layout is "Label Set aware". You can configure Tabular Layout with or without labels. This article will detail both methods. For more information on Label Sets, please visit the full Label Sets article.

You may download and import the file(s) below into your own Grooper environment (version 2024). There are two Batches with the example document(s) discussed in this tutorial, as well as two Projects configured according to its instructions.
Please upload the Projects to your Grooper environment before uploading the Batches. This will allow the documents within the Batches to maintain their classification status.

About

Many tables label the columns so the reader knows what the data in that column corresponds to. How do you know the unit price for an item on an invoice? Typically, that item is in a table and one of the columns of that table is labeled "Unit Price" or something similar. Once you read the labels for each column (also called "column headers"), you the reader know where the table begins (below the column headers) and can identify the data in each row (by understanding what the column headers refer to).

This is also the basic idea behind the Tabular Layout Extraction Method. It too utilizes column header labels to "read" tables on documents, or at least as step number one in modeling the table's structure. Once Grooper knows where a column is, identified by the column's header label, Grooper can extract data from each cell in each row of that column.

The Tabular Layout method can establish column header locations in one of two ways:

  1. Using extractors
    • Which are defined on the Data Columns' Header Extractor property (or alternatively on the Data Table's Header Row Extractor property)
  2. Using Label Sets
    • When a Labeling Behavior is enabled, column header locations are defined by labels collected for the Data Columns (and optionally for the Data Table)
    • Effectively, the labels take the place of the Header Extractor results (or alternatively the Header Row Extractor results)

Once the column header locations are established, the next thing Grooper needs to do is figure out where each row is. Tabular data is most often dynamic data. A table on one document might have two rows. The same table on the next might have twenty. How does Grooper know where each row is?

This is done by configuring at least one Data Column's Value Extractor property (However, more than one, even all, may be configured. Depending on how complicated the table is, you may need to configure extractors for multiple columns.)

Generally, there is at least one column in a table that is always present for every row in the table. If you can use an extractor to locate that data below its corresponding column header, that gives you a way of finding each row in the table. This allows Grooper to form a "row instance" for each row. Once the row instance is established, Grooper can then collect the various cell values for the various additional columns from the row instance.

If locating column headers and locating rows using column extractors was all that was involved in Tabular Layout, that alone would make it a powerful tabular extraction method. What makes the Tabular Layout method even more powerful is its further configurability. Is every row in the table a single line or are the rows "multiline"? Do you need more fine-tuned data extraction from a cell's value or the row itself once the row instance is detected? Do you need to establish a table "footer" to limit the number of rows extracted? We will address these issues and more in the #Advanced Setup Considerations section of this article.

FYI

If your familiar with the Header-Value table extraction method, you should see some similarities between it and the Tabular Layout method. Indeed both methods utilize column headers and Data Column Value Extractors to collect table data.

Tabular Layout should be seen as an improvement on Header-Value for the following reasons:

  1. Tabular Layout is Label Set aware.
  2. Tabular Layout is typically less involved to set up.
  3. Tabular Layout has more configuration options, giving it a better capability to extract data from a large set of disparate table structures (Usually executed through Data Element Overrides).

Basic Setup

Tabular Layout can be configured with or without the use of Label Sets. In either case, the basic setup is the same:

  1. Establish column headers for each Data Column.
  2. Detect row instances by assigning at least one Data Column's Value Extractor.
  3. Set the Data Table's Extract Method property to Tabular Layout.
  4. Test extraction and configure further as necessary.

With Label Sets or without, the setup is extremely similar. On top of that, there's nothing about using Label Sets that alters Tabular Layout's extraction logic. Grooper uses the same logic to model the table's structure and collect data for each cell. The biggest difference is how column headers are determined in step #1.

  • Without Label Sets, column headers are established using extractors, defined using the Data Columns' Header Extractor property (or alternatively using the Data Table's Header Row Extractor property)
  • With Label Sets, column headers are established using labels, defined when collecting labels for each Document Type. The Data Columns' labels effectively take the place of the Header Extractor property's results.

Tabular Layout Without Label Sets

Overview

This tutorial will cover the basic configuration of the Tabular Layout method without Label Sets, using extractors to collect column headers instead. We will use invoices for our document set and collect the following data from their tables detailing line item information:

  • Item Number - The vendor's id number for the item ordered for each row.
  • Description - The description of each item ordered for each row.
  • Quantity - The number of the item ordered for each row.
  • Unit Price - The vendor's price for the item ordered for each row.
  • Line Total - The total price for the number of items ordered (In other words, the quantity ordered multiplied by the unit price)


The basic steps will be as follows:

  1. Establish column headers by configuring the Header Extractor property of each Data Column in the Data Table.
    • You must configure header extractors for each Data Column whose data you want to collect.
    • Alternatively, you may configure a Header Row Extractor set on the Data Table (This property is found in the Tabular Layout sub-properties).
  2. Assign a Value Extractor for at least one Data Column.
    • For example, we may expect to find a quantity for each item shipped on an invoice, regardless of the vendor. There's always a column with a "Quantity" or "QTY" or "Shipped" or some similar header.
    • Since this data is also present on every row, this will provide the information necessary to find each row in the table.
    • While you need at least one Data Column's Value Extractor configured to detect rows, multiple columns may be used to detect rows.
      • Furthermore, a Data Column's Value Extractor will either perform "Primary Extraction" to perform row detection or "Secondary Extraction" to extract data from already detected rows. We will discus using multiple columns to detect rows and the differences between "Primary" and "Secondary Extraction" in the #Advanced Setup Considerations section of this article.


  1. Set the Data Table object's Extract Method property to Tabular Layout.
    • And configure any Tabular Layout properties as needed. We will discuss many of these properties, why and how to to use them in the #Advanced Setup Considerations section of this article.
  2. Test to ensure the table's data is collected.


In a perfect world, you're done at that point. As you can see in this example, we've populated a table. Data is collected for all four Data Columns for each row on the document.

However, the world is rarely perfect. We will discuss some further configuration considerations to help you get the most out of this table extraction method in the #Advanced Setup Considerations section below.

1. Configure Header Extractors

As far as strict requirements go for the Tabular Layout method goes, you must at minimum establish column headers for each Data Column you wish to extract.

We'll start with the "Quantity" Data Column.

  • FYI: If the invoice lists both a "quantity ordered" and a "quantity shipped" column, we will be collecting the quantity shipped.


  1. Select the Data Column.
  2. Select the Header Extractor property.
    • Here you will set an extractor to locate the column header on the document for the selected Data Column.
  3. Using the dropdown selector, select the extractor ( Extractor Node or Value Extractor) you wish to configure to return the column header.
    • You can use whatever extractor you want to get the job done. You may select Reference to reference a Data Type or Value Reader node you've configured already. Or, you can select a Value Extractor to configure extraction locally.
    • We're going to select List Match.


The List Match extractor is well suited for our purposes here. Ultimately, we will enter a list of various ways a "Quantity" column can be labeled.

  1. For example, this document labels quantities of each item ordered as "HRS / QTY"
  2. So, we've added HRS / QTY to the Local Entries list.
  3. Other documents use the label "Quantity" or "Shipped". So, we've added Quantity and Shipped to the list as well.

You would then continue adding variations to the list until all variations of the "Quantity" column's header labels are extracted for every variation of the table.

  • Or more generally, until a result for the column header is extracted using whatever extractor you've chosen to configure.

Pro Tip: Stacked Labels

You will often find "stacked labels" in tables. These are multi-word labels broken up across multiple lines in the table's header.

  1. For example, this document's "Quantity" column uses "Qty Shp." for its label.
    • This is a stacked label, with "Qty" on one line and "Shp." on another.
  2. We can add "Qty Ship." to our list of header labels.
  3. However, we will not get a result returned for the document.


We can easily resolve this by enabling the Vertical Wrap feature.

  • This feature is only available to the List Match extractor. This is one of the reasons why List Match is so useful for extracting column headers.

To enable Vertical Wrap:

  1. Switch to the "Properties" tab.
  2. Change the Vertical Wrap property to Enabled.
  3. With Vertical Wrap enabled, the extractor is able to match and return items in the list that wrap vertically on multiple lines.
    • In our case, our stacked label "Qty Shp." is now returned.


Repeat Until All Data Columns Are Configured

You will repeat the same process for each Data Column you want to collect.

  1. We want to collect data from all these columns.
  2. So, we've configured each Data Column's Header Extractor property.


Once the Header Extractor for each Data Column is configured, Grooper will "know" where our tables "start". However, all the actual data in the table is defined by its rows. How does Grooper know where each row is? We will discuss that in the next tab.

For our document set, we used the following lists of header column labels:

"Item Number"

ITEM NO
ITEM #
Item Number
Part Number/Description
PART NUMBER

"Description"

ITEM DESCRIPTION
DESCRIPTION
Part Number/Description

"Quantity"

HRS / QTY
Quantity
Shipped
Qty Shp.
Qty

"Unit Price"

RATE / PRICE
UNIT PRICE
Unit Rate

"Line Total"

SUBTOTAL
TOTAL
Extended Price
Ext. Price
Ext Price
NET AMOUNT
Value
Line Total

FYI

You may have noticed Part Number/Description is present in both the "Item Number" and "Description" columns' header lists.

This can happen. Depending on a table's format, what would normally be divided up between two columns on other documents may be jammed into one. Tabular Layout has methods to account for this, using what's called "Secondary Extraction".

• For more information on Secondary Extraction, please visit the #Primary VS Secondary Extraction portion of this article.

2. Assign a Data Column's Value Extractor

This step is all about row detection.

So far all we've done is established header column positions on each document. But, that's not where the data is. The table's data is in the rows.

As it stands, Grooper doesn't know anything about the rows in the tables. It doesn't know the size of each row. It doesn't know what kind of data is supposed to be in the rows. Maybe most importantly, it doesn't know how many rows there are. Tables tend to be dynamic. They may have 3 rows on one document and 300 on the next. Grooper needs a way of detecting this.

To detect rows, we need at least one Data Column's Value Extractor property configured. For each result the extractor produces below the column's header, Grooper will create one row instance.

The key thing to keep in mind is this data must be present on every row. You'll want to pick a column whos data is always present for every row, where it would be considered invalid if the information wasn't in that cell for a given row.

In our case, we will choose the "Quantity" Data Column. We always expect (for the time being anyway) there to be a quantity listed for the line item on the invoice.

  1. We will use this Value Reader for our demonstration.
    • However, in the real world, the extraction world is your oyster. You'll configure an extractor to best target the data in whatever table column you're trying to extract.
  2. This is a fairly simple Pattern Match extractor designed to return numeric data (including currency).
  3. The regex is a fairly simple pattern to match generic quantities.
    • It'll match decimal values from 0 and above with two decimal places optional.
  4. We've also edited our Prefix and Suffix Patterns so that the pattern must be surrounded by a space character before and after, with an optional dollar sign before the number.
  5. As you can see, we get five results below the "Quantity" label.
    • When we assign this Value Reader to the "Quantity" Data Column, we should then get five rows when this table extracts.


We do get a bunch of other hits as well. This is a very generic extractor matching very generic numerical data.

  1. Will this result present a problem? Will we get an extra row for its result?
    • No. That result is above the header label HRS / QTY established by the Data Column's Header Extractor.
    • The Tabular Layout method presumes rows are below column labels. Any and all results above the first instance of the column's headers will be ignored.
  2. What about these matching results on the same line? Will the extra results create additional row instances?
    • No. These results are misaligned with the "Quantity" Data Column's header. They are too far to the right to be considered under the column header. They will be ignored.
    • Only results aligned with the "Quantity" Data Column's header will create a row instance.
  3. What about these results? Will they produce a row?
    • No. These results are also misaligned with the "Quantity" Data Column's header.
    • That said, if these were aligned with the "Quantity" Data Column's header, they would produce row instances.
    • When you are building your own Data Column extractors, pay close attention to results below the column's header. They have the most potential to produce false positive results, producing erroneous rows.
      • That said, there are a multitude of ways to avoid false positive row results when using Data Columns' Value Extractors to detect rows. We will discuss this more in the #Advanced Setup Considerations portion of this article.


With our extractor ready to go, all we need to do is assign it to the "Quantity" Data Column using its Value Extractor property.

  1. Select the Data Column you wish to configure.
    • In our case, we want to configure the "Quantity" Data Column.
  2. Configure the Value Extractor property.
    • In our case, we've referenced our Value Reader designed to return generic numeric values.


FYI

At bare minimum you must configure at least one Data Column's Value Extractor to perform row detection.

However, multiple columns may be used to perform row detection by configuring their corresponding Data Columns Value Extractor properties. For more information on using multiple columns in row detection (as well as row detection in general) please visit the #Advanced Row Detection section of this article.


So far, we have:

  1. Collected labels for the Data Column labels (and optionally the whole row of column labels for the Data Table)
  2. Configured at least one Data Column with its Value Extractor configured.

For fairly simple table structures, we now have the two things the Tabular Layout method needs to extract data. Now, all we need to do is tell the Data Table object we want to use the Tabular Layout method. We do this by setting its Extract Method property to Tabular Layout.

3. Set Extract Method to Tabular Layout

A Data Table's extraction method is set using the Extract Method property. To enable the Tabular Layout method, do the following.

  1. Select a Data Table object in your Data Model.
    • Here, we've selected the "Line Items" Data Table.
  2. Select the Extract Method property.
  3. Using the dropdown menu, select Tabular Layout


4. Test

Now, let's test out what we have and see what we get!

  1. For the selected document folder in the "Batch Viewer" window...
  2. Press the "Test Extraction" button.
  3. The results show up in the "Data Element Preview" window.


So, how was Grooper able to do this? For the Tabular Layout method, the Data Table is populated using primarily two pieces of information: column header locations established by the Data Columns' Header Extractors and rows locations detected by a Data Column's Value Extractor.

  • Remember, we configured Header Extractors for all Data Columns. We configured only the "Quantity" Data Column's Value Extractor'.

First, it's all about establishing column headers.

  1. The Data Columns' Header Extractors established the column locations for each column.
  2. Grooper then determines the width of these columns.
    • If table lines are present, Grooper can detect those line locations via a Line Detection (or Line Removal) IP Command. Grooper will "snap" the column's width to the detected line boundaries, expanding the cell's width (and height) to the boundaries around it.
      • Table lines give human readers an indicator of where the data "lives" (or is contained). If it's in the box, it belongs to the column. If it's out of the box, it belongs to a different column.
    • If table lines are not present (as is the case for this document), Grooper performs a variety of gutter-detection operations, analyzing the whitespace between columns to determine their widths.
      • Most commonly Grooper will average the distance between one header label and the next.


Second, it's all about detecting rows. Rows are detected using a Data Column's Value Extractor.

  • In our case, we configured the "Quantity" Data Column's Value Extractor.
  • FYI: When a Data Column's extractor is used to detect rows, it is considered "Primary Extraction". A Data Column's extractor can also be used for "Secondary Extraction", performed after rows are detected. For more on this, please visit the #Primary VS Secondary Extraction section of this article.
  1. Rows are only detected below the detecting Data Column's header.
  2. Grooper runs the detecting Data Column's Value Extractor, looking for matching results aligned below the column header.
  3. For each result returned, Grooper establishes one row instance.
    • Since our extractor was designed to return decimal values, and Grooper found five decimal values below our column header, Grooper detected five rows.


The Tabular Layout method now has the two pieces of information it needs to determine the table's structure. If you know where the columns are and how big they are, and you know how many rows there are, you pretty much know what the table looks like. Grooper can infer the table's grid-like structure using the column and row positions.

  1. It has column instances for each Data Column.
    • Again, established by each Data Column's Header Extractor'.
  2. It has row instances for each detected row.
    • Again, established by the detecting Data Column's Value Extractor.
      • FYI: More than one Data Column can be used to detect rows. Please visit the #Advanced Row Detection section for more information.


With these column and row instances established, Grooper can form data instances for each cell of the table.

  1. Each cell's data simply lays where the columns and rows intersect.
    • For Data Columns with their Value Extractors configured, values are either collected using "Primary" or "Secondary Extraction". Please see the #Primary VS Secondary Extraction portion for more information.
    • For Data Columns without their Value Extractors configured, values are collected by returning the OCR or native text data within the geometric boundaries of the cell.
      • This is extremely beneficial for data that is difficult to extract using pattern matching.
      • For example, invoice item numbers and descriptions are notoriously difficult to pattern match. By using something in the table that is easy to pattern match, like our item quantities, we can use Tabular Layout to model the table structure and collect the other column values that are not.

5. Alternative Configuration: Header Row Extractor

You may alternatively establish column headers for the entire row of header labels, using the Header Row Extractor property. Instead of configuring each Data Column's Header Extractor, you would configure an extractor to return the whole table's row of column headers and use named instances (either Named Groups or child extractors) to establish each Data Column's header.

There are two reasons using a Header Row Extractor can be beneficial:

  1. It can be a way to throw out false positive column matches.
  2. It can be a way to better take advantage of Fuzzy RegEx.

Configuring the Header Row Extractor will override all Data Columns Header Extractors.

You should choose to either establish column headers using the Header Row Extractor or do so using each Data Column's Header Extractors.

You may find it beneficial to configure Data Column Header Extractors as the "default" configuration and use Data Element Overrides to pick and choose which Document Types you want to use the Header Row Extractor instead.

Craft the Extractor

To configure the Header Row Extractor, you will need to craft an extractor (or multiple extractors for multiple table formats). We will choose to do that first by creating a few Value Reader and Data Types.

  1. We've started creating a Value Reader to use as a Header Row Extractor for the "Fairdeal" Document Type in our Content Model.
  2. We're using a Pattern Match extractor.
    • We can easily match the header row for "Fairdeal" invoices using a simple regex pattern.
  3. Your first task will be to extract the entire row of column headers. The pattern we have here will do just that.
DESCRIPTION\t
ITEM NO\t
HRS / QTY\t
PER\t
RATE / PRICE\t
SUBTOTAL
  1. The pattern matches the whole row of column headers.


This is only step one. Next, we need some way of breaking up the result into each component column. How does Grooper know what part of the result is the label for the "Description" column or the "Quantity" column? It doesn't until you break up the result into named instances that match the names of your Data Columns in the Data Table. These named instances can either be:

  • Named Groups
  • Named Child Extractors
Assign Named Instances: Using Named Groups

When pattern matching a header row, you can do this with Named Groups.

  1. We've placed the portion of the regular expression matching the "Description" column's label in a Named Group.
    • (?<Description>DESCRIPTION)
    • The group is created just like any group by placing the regex in parenthesis
      • (regex goes here)
    • The group is named by inserting the ?<> tag.
      • ?<>(regex goes here)
    • The name is given by typing it between the angle brackets.
      • (?<name goes here>regex goes here)
  2. This produces a named instance capturing only the regex in the group.
    • In this case, the label for the "Description" column.
  3. The key here is that the name we gave the group, "Description", matches the name of the Data Column, "Description".
    • Since the names match, Grooper will use the Named Group's instance to establish the column header for the "Description" Data Column.
    • Effectively, the Named Group supplies the result for the Data Column's Header Extractor.
      • BE AWARE! This also means the Named Group replaces the result of a Data Column's Header Extractor. If you configure a Header Row Extractor, it will supersede any Header Extractor on any Data Column.


  1. You would then continue placing Named Groups around the remaining column headers, chunking out the regex and matching each chunk with the corresponding Data Column.

The following regex would accomplish this goal in our case.

(?<Description>DESCRIPTION)\t
(?<Item_Number>ITEM NO)\t
(?<Quantity>HRS / QTY)\t
PER\t
(?<Unit_Price>RATE / PRICE)\t
(?<Subtotal>SUBTOTAL)

Please note space characters are not allowed in Named Groups. You must replace a space character with an underscore _.

For example, to match the "Item Number" Data Column, we named the group Item_Number

Assign Named Instances: Using Named Child Extractors

You may also create and use the named instances by naming a Data Type's child extractors to match the names of your Data Columns.

  1. For example, this Data Type uses the Ordered Array collation method to return the header row for our "Factura" Document Type.
  2. We still get one complete result for the header row on this invoice format.
  3. Instead of a single regex pattern, we're collating results from its child extractors.
  4. Each child extractor's name matches one of our Data Columns.
  5. Inspecting the header row's instance (by right-clicking the result in the Results list), we can see more clearly how these results sub-instances will be supplied as each Data Column's header.


  1. In the Instance Viewer, we can select any of our sub-instances from our child extractors.
  2. This result is what will be used for the "Description" column's header.
  3. Since the name of the child extractor (and therefore also sub-instance) matches the "Description" Data Column, the result will be used in place of its Header Extractor.

Assign the Header Row Extractor

Now that we have a couple examples of header row extractors, we can assign them using Tabular Layout's Header Row Extractor property.

  1. To assign a header row extractor, select the Data Table.
  2. Expand the Tabular Layout sub-properties.
  3. Expand the Header Detection sub-properties.
  4. Using the Header Row Extractor property, configure your header row extractor.
  5. In our case, we set Header Row Extractor to Reference and pointed to one of the extractors detailed previously.


But be careful! If you choose this approach, assigning the Header Row Extractor will supplant any Header Extractor configuration for any of your Data Columns.

• If configured, the Header Row Extractor establishes column headers instead of multiple Data Columns Header Extractors.

The extractor we referenced was very specifically designed with only one table format in mind. It works for invoices assigned the "Fairdeal" Document Type, but no others.

  1. If we were to test our Data Table on a different document with a different table structure, we would get no results.
  2. Because the extractor doesn't match this table format's row of column headers, it can't establish any column headers for this document.
  3. This is despite the fact these Data Columns have their Header Extractor properties configured to do so.


If you take this approach to establish column headers you will either need to:

  • Craft a single extractor that matches multiple row header formats.
  • Or, use Data Element Overrides to configure a unique Header Row Extractor for each Document Type.

Why Bother?

There are two main reasons why Header Row Extractors can be beneficial:

  1. To throw out false positive column header matches
  2. To better match column headers with poor OCR using Fuzzy RegEx.
To Throw Out False Positives

The first reason to use a Header Row Extractor is to help eliminate false positive column header matches.

  1. Take our "Line Total" Data Column.
  2. Its Header Extractor is configured with List Match extractor, matching a variety of possible header labels for this column


  1. This table format uses the label SUBTOTAL for the "Line Total" column.
  2. It certainly matches the column header correctly.
  3. But it also matches an instance on this document where the same term is used to refer to something different.
    • This is a false positive match.


A row of header labels tends to be more specific (and requires more specific extraction logic).

  1. If we refer back to our Header Row Extractor for this document, we'll see there is no potential false positive.
  2. The extractor matches the label SUBTOTAL as part of the larger row of headers.
  3. Given that the extractor is now looking for that label within the larger context of a header row, our false positive is no longer returned.


This is to be sure a more specific, and therefore more accurate extractor. However, you shouldn't always assume more accurate is necessarily "necessary". In this case, the false positive did not impact our table whatsoever. So, while, yes, the Header Row Extractor is technically more accurate, our Data Table would have returned accurate data using Data Column headers alone (even with the false positive match).

  • While a Header Row Extractor can eliminate false positive column header matches, you only need to go through the trouble of configuring one if those false positive matches poorly impact your data extraction.

For Fuzzy RegEx

The other reason to use a Header Row Extractor has to do with imperfect OCR text data and Fuzzy RegEx. Fuzzy RegEx provides a way for regular expression patterns to match in Grooper when the text data doesn't strictly match the pattern. The difference between the regex pattern Grooper and the character string "Gro0per" is just off by a single character. An OCR engine misreading an "o" character for a zero is not uncommon by any means, but a standard regex pattern of Grooper will not match the string "Gro0per". The pattern expects there to be an "o" where there is a zero.

Using Fuzzy RegEx instead of regular regex, Grooper will evaluate the difference between the regex pattern and the string. If it's similar enough (if it falls within a percentage similarity threshold) Grooper will return it as a match.

  • FYI "similarity" may also be referred to as "confidence" when evaluating (or scoring) fuzzy match results. Grooper is more or less "confident" the result matches the regex pattern based on the fuzzy regex similarity between the pattern and the imperfect text data. A similarity of 90% and a confidence score of 90% are functionally the same thing (One could argue there is a difference between these two terms when Fuzzy Match Weightings come into play, but that's a whole different topic. And you may encounter Grooper users who use the terms "similarity" and "confidence" interchangeably regardless. Visit the Fuzzy RegEx article if you would like to learn more).


Let's go back to the List Match extractor for our "Line Total" Data Column's Header Extractor.

  1. This table format uses the label TOTAL for the "Line Total" column.
  2. However, it does not match the header on the document.
  3. Why not? This is due to imperfect OCR results.
    • The label TOTAL was misrecognized as TOFAL.


We can certainly get this label to match with Fuzzy RegEx, but only at a fairly low similarity.

  1. Here, we've enabled Fuzzy Matching and set the Minimum Similarity to 85%.
  2. We do get our header label returned.
  3. But it's at a confidence score of 86%.
    • This score may be too low. It's not causing a problem for this document, but it may pose issues for others.


The reason why the similarity score is so low is because "TOTAL" is a relatively small word, five characters long. Grooper's confidence rating in a match lessens, the more character swaps it has to make to match the word.

An entire row of headers, on the other hand, has much more characters in it. The cost to swap a single character in the entire row of headers will be much less, and much more negligible.

  1. This Value Reader is designed to match the whole header row for this invoice format.
  2. Its Fuzzy Matching property is enabled with its Minimum Similarity set to 90%.
  3. The whole header row matches at a much higher confidence score of 98%.

Disabling Data Columns for Specific Document Types

Occasionally, you will run into a situation where you want to collect a column that exists for some document formats but not for others. You will need to utilize Data Element Overrides to account for this.

For example, some of these invoices list a "unit of measure". The customer is invoiced for "1 each" of a product or "2 hours" of a service. "Each" or "hours" is the unit of measure. However, not all invoices have a column for this in their line items. You may want to collect the unit of measure if the column is present. So, you would add a "Unit" Data Column and configure its Header Extractor and, if necessary, Value Extractor properties.

But obviously, you can't collect it from documents where there is no "unit of measure" column. The "Factura" Document Type is one such vendor who does not list a unit of measure. You would need to remove the Header Extractor in the Document Type's "Overrides" panel.


  1. Here, we've selected the "Factura" Document Type.
  2. Navigate to the "Overrides" tab to configure Data Element Overrides.
  3. Select the Data Column you wish to override.
    • In this case, since the "Unit" column does not exist for the "Factura" Document Type we are removing its Header Extractor.
  4. Change the Header Extractor property to (none).
  5. FYI It would be beneficial to turn this Data Column's Visible property to False in this case. This would not affect extraction, but it would remove the column from a data reviewer's sight.


By removing the absent column's Header Extractor Grooper is no longer looking for a header that is not there! The table will then extract successfully.

Tabular Layout With Label Sets

Overview

This tutorial will cover the basic configuration of the Tabular Layout method with Label Sets, using a Labeling Behavior to collect column headers. We will use invoices for our document set and collect the following data from their tables detailing line item information:

  • Item Number - The vendor's id number for the item ordered for each row.
  • Description - The description of each item ordered for each row.
  • Quantity - The number of the item ordered for each row.
  • Unit Price - The vendor's price for the item ordered for each row.
  • Line Total - The total price for the number of items ordered (In other words, the quantity ordered multiplied by the unit price)


The basic steps will be as follows:

  1. Establish column headers by collecting labels for each Data Column in the Data Table.
    • You must collect header labels for each Data Column whose data you want to collect.
    • You may optionally collect a label for the entire row of header labels by collecting a label for the Data Table.
      • It is also considered best practice to do so when using Label Sets to configure Tabular Layout.
  2. Assign a Value Extractor for at least one Data Column.
    • For example, we may expect to find a quantity for each item shipped on an invoice, regardless of the vendor. There's always a column with a "Quantity" or "QTY" or "Shipped" or some similar header.
    • Since this data is also present on every row, this will provide the information necessary to find each row in the table.
    • While you need at least one Data Column's Value Extractor configured to detect rows, multiple columns may be used to detect rows.
      • Furthermore, a Data Column's Value Extractor will either perform "Primary Extraction" to perform row detection or "Secondary Extraction" to extract data from already detected rows. We will discus using multiple columns to detect rows and the differences between "Primary" and "Secondary Extraction" in the #Advanced Setup Considerations section of this article.


  1. Set the Data Table object's Extract Method property to Tabular Layout.
    • And configure any Tabular Layout properties as needed. We will discuss many of these properties, why and how to to use them in the #Advanced Setup Considerations section of this article.
  2. Test to ensure the table's data is collected.


In a perfect world, you're done at that point. As you can see in this example, we've populated a table. Data is collected for all four Data Columns for each row on the document.

However, the world is rarely perfect. We will discuss some further configuration considerations to help you get the most out of this table extraction method in the #Advanced Setup Considerations section below.

1. Collect Column Labels

The following tutorial will presume you have general familiarity with collecting labels. See the Label Sets article for a full explanation of how to collect labels for Document Types in a Content Model.

As far as strict requirements go for the Tabular Layout method goes, you must at minimum establish column headers for each Data Column you wish to extract.

We'll start with the "Quantity" Data Column.

  • FYI: If the invoice lists both a "quantity ordered" and a "quantity shipped" column, we will be collecting the quantity shipped.


For this "Fairdeal" Document Type, one column header label has been collected for each of the five Data Column children of the "Line Items" Data Table.

  1. The label ITEM NO for the "Item Number" Data Column
  2. The label DESCRIPTION for the "Description" Data Column
  3. The label HRS / QTY for the "Quantity" Data Column
  4. The label PER for the "Unit" Data Column
  5. The label RATE / PRICE for the "Unit Price" Data Column
  6. The label SUBTOTAL for the "Line Total" Data Column


As far as strict requirements go for establishing header columns, you're done at this point. You would then repeat this same process for every Document Type in your Content Model.


Best Practice: Collect a Header Row Label for the Data Table

You may optionally collect a label for the entire row of column header labels (aka the "header row label"). This label is collected for the parent Data Table object's label.


  1. We've collected the label DESCRIPTION ITEM NO HRS / QTY PER RATE / PRICE SUBTOTAL for the "Line Items" Data Table.


It is considered best practice to capture a header row label for the Data Table. But if it's optional, why do it? What is the benefit of this label?

Why Bother?

There are two main reasons why Header Row Extractors can be beneficial:

  1. To throw out false positive column header matches
  2. To better match column headers with poor OCR using Fuzzy RegEx.
To Throw Out False Positives

The first reason to collect a header row label is to help eliminate false positive column header matches.

  1. Take our "Line Total" Data Column's label SUBTOTAL.
  2. Without the Data Table's header row label, this label would also produce a match.
    • This is a false positive match. This is an instance on this document where the same term is used to refer to something different.
  3. With the header row label, only the actual label for the column matches.
    • Another way of putting it: The Data Column header labels will only match if they are part of the larger Data Table header row label.

For Fuzzy RegEx

The other reason to collect a header row label has to do with imperfect OCR text data and Fuzzy RegEx. Fuzzy RegEx provides a way for regular expression patterns to match in Grooper when the text data doesn't strictly match the pattern. The difference between the regex pattern Grooper and the character string "Gro0per" is just off by a single character. An OCR engine misreading an "o" character for a zero is not uncommon by any means, but a standard regex pattern of Grooper will not match the string "Gro0per". The pattern expects there to be an "o" where there is a zero.

Using Fuzzy RegEx instead of regular regex, Grooper will evaluate the difference between the regex pattern and the string. If it's similar enough (if it falls within a percentage similarity threshold) Grooper will return it as a match.

  • FYI: "Similarity" may also be referred to as "confidence" when evaluating (or scoring) fuzzy match results. Grooper is more or less "confident" the result matches the regex pattern based on the fuzzy regex similarity between the pattern and the imperfect text data. A similarity of 90% and a confidence score of 90% are functionally the same thing (One could argue there is a difference between these two terms when Fuzzy Match Weightings come into play, but that's a whole different topic. And you may encounter Grooper users who use the terms "similarity" and "confidence" interchangeably regardless. Visit the Fuzzy RegEx article if you would like to learn more).


So how does this apply to the Data Table's header row label? The short answer is it provides a way to increase the accuracy of Data Column header labels by "boosting" the similarity of the label to imperfect OCR results.

  1. We're going to look at labels collected for the "Rechnung" Document Type to illustrate this.
  2. Examine the collected label for the "Line Total" Data Column.
    • Notice the label TOTAL is highlighted red. The label doesn't match the text on the document.
    • This is due to imperfect OCR results.
  3. OCR made some missteps and recognized that segment as TOFAL.
    • The second "T" in "TOTAL" was recognized as an "F" character.
    • This means "TOTAL" (the expected label) is one character's difference from "TOFAL" (the actual text data). Or, "TOFAL" is 80% similar to "TOTAL".
    • The Labeling Behavior's similarity threshold is set to 90% for this Content Model. 80% is less than 90%. So, the result is thrown out.
    • FYI: This threshold is configured when the Labeling Behavior is added, using the Behaviors property of a Content Model. The Label Similarity property is set to 90% by default, but can be adjusted at any time.


As we will see, capturing the full row of column header labels will boost the similarity, allowing the label to match without altering the Labeling Behavior's fuzzy match settings.


  1. Here, we've collected a header row label for the Data Column.
  2. Now the "Line Total" Data Column's label matches! MAGIC!


Not magic. Just math.

The Data Table's column header row label is much much longer than a single Data Column's column header label. There are just more characters in PO ITEM # DESCRIPTION QUANTITY UNIT PRICE TOTAL\r\nLINE # than TOTAL (55 vs 5).

  • Where the "Line Total" Data Column's label is 80% similar to the text data (4 out of 5 characters), the "Line Item" Data Table's label, comprised of the whole row of column labels, is roughly 98% similar to the text data (54 out of 55 characters).

Utilizing a Data Table label allows you to hijack the whole row's similarity score when a single Data Column does not meet the similarity threshold.

  • If the label can be matched as a part of the larger whole, its confidence score goes up much further than by itself.
  • The Data Table's larger label of the full row of column labels gives extra context to the "Line Items" Data Column label, providing more information about what is and is not an appropriate match.


So why is it considered best practice to capture a header row label for the Data Table? OCR errors are unpredictable.

The set of examples you worked with when architecting this solution may have been fairly clean with good OCR reads. Maybe it didn't seem like you needed a Data Table label at the time, but that may not always be the case. Capturing a Data Table label for the header row will act as a safety net to avoid unforeseen problems in the future.

2. Assign a Data Column's Value Extractor

This step is all about row detection.

So far all we've done is established header column positions on each document. But, that's not where the data is. The table's data is in the rows.

As it stands, Grooper doesn't know anything about the rows in the tables. It doesn't know the size of each row. It doesn't know what kind of data is supposed to be in the rows. Maybe most importantly, it doesn't know how many rows there are. Tables tend to be dynamic. They may have 3 rows on one document and 300 on the next. Grooper needs a way of detecting this.


To detect rows, we need at least one Data Column's Value Extractor property configured. For each result the extractor produces below the column's header, Grooper will create one row instance.

The key thing to keep in mind is this data must be present on every row. You'll want to pick a column whos data is always present for every row, where it would be considered invalid if the information wasn't in that cell for a given row.

In our case, we will choose the "Quantity" Data Column. We always expect (for the time being anyway) there to be a quantity listed for the line item on the invoice.

  1. We will use this Value Reader for our demonstration.
    • However, in the real world, the extraction world is your oyster. You'll configure an extractor to best target the data in whatever table column you're trying to extract.
  2. This is a fairly simple Pattern Match extractor designed to return numeric data (including currency).
  3. The regex is a fairly simple pattern to match generic quantities.
    • It'll match decimal values from 0 and above with two decimal places optional.
  4. We've also edited our Prefix and Suffix Patterns so that the pattern must be surrounded by a space character before and after, with an optional dollar sign before the number.
  5. As you can see, we get five results below the "Quantity" label.
    • When we assign this Value Reader to the "Quantity" Data Column, we should then get five rows when this table extracts.


We do get a bunch of other hits as well. This is a very generic extractor matching very generic numerical data.

  1. Will this result present a problem? Will we get an extra row for its result?
    • No. That result is above the header label HRS / QTY.
    • The Tabular Layout method presumes rows are below column labels. Any and all results above the first instance of the column's headers will be ignored.
  2. What about these matching results on the same line? Will the extra results create additional row instances?
    • No. These results are misaligned with the "Quantity" Data Column's header. They are too far to the right to be considered under the column header. They will be ignored.
    • Only results aligned with the "Quantity" Data Column's header will create a row instance.
  3. What about these results? Will they produce a row?
    • No. These results are also misaligned with the "Quantity" Data Column's header.
    • That said, if these were aligned with the "Quantity" Data Column's header, they would produce row instances.
    • When you are building your own Data Column extractors, pay close attention to results below the column's header. They have the most potential to produce false positive results, producing erroneous rows.
      • That said, there are a multitude of ways to avoid false positive row results when using Data Columns' Value Extractors to detect rows. We will discuss this more in the #Advanced Setup Considerations portion of this article.


With our extractor ready to go, all we need to do is assign it to the "Quantity" Data Column using its Value Extractor property.

  1. Select the Data Column you wish to configure.
    • In our case, we want to configure the "Quantity" Data Column.
  2. Configure the Value Extractor property.
    • In our case, we've referenced our Value Reader designed to return generic numeric values.


FYI

At bare minimum you must configure at least one Data Column's Value Extractor to perform row detection.

However, multiple columns may be used to perform row detection by configuring their corresponding Data Columns Value Extractor properties. For more information on using multiple columns in row detection (as well as row detection in general) please visit the #Advanced Row Detection section of this article.

So far, we have:

  1. Collected labels for the Data Column labels (and optionally the header row label for the Data Table)
  2. Configured at least one Data Column with its Value Extractor configured.

For fairly simple table structures, we now have the two things the Tabular Layout method needs to extract data. Now, all we need to do is tell the Data Table object we want to use the Tabular Layout method. We do this by setting its Extract Method property to Tabular Layout.

3. Set Extract Method to Tabular Layout

A Data Table's extraction method is set using the Extract Method property. To enable the Tabular Layout method, do the following.

  1. Select a Data Table object in your Data Model.
    • Here, we've selected the "Line Items" Data Table.
  2. Select the Extract Method property.
  3. Using the dropdown menu, select Tabular Layout

4. Test

Now, let's test out what we have and see what we get!

  1. For the selected document folder in the "Batch Viewer" window...
  2. Press the "Test Extraction" button.
  3. The results show up in the "Data Element Preview" window.
    • Success! Our table's data is collected!


So, how was Grooper able to do this? For the Tabular Layout method, the Data Table is populated using primarily two pieces of information: column header locations established by the Data Columns' labels and rows locations detected by a Data Column's Value Extractor.

  • Remember, we collected labels for all Data Columns. We configured only the "Quantity" Data Column's Value Extractor'.

First, it's all about establishing column headers.

  1. The Data Columns' labels established the column locations for each column.
  2. Grooper then determines the width of these columns.
    • If table lines are present, Grooper can detect those line locations via a Line Detection (or Line Removal) IP Command. Grooper will "snap" the column's width to the detected line boundaries, expanding the cell's width (and height) to the boundaries around it.
      • Table lines give human readers an indicator of where the data "lives" (or is contained). If it's in the box, it belongs to the column. If it's out of the box, it belongs to a different column.
    • If table lines are not present (as is the case for this document), Grooper performs a variety of gutter-detection operations, analyzing the whitespace between columns to determine their widths.


Second, it's all about detecting rows. Rows are detected using a Data Column's Value Extractor.

  • In our case, we configured the "Quantity" Data Column's Value Extractor.
  • FYI: When a Data Column's extractor is used to detect rows, it is considered "Primary Extraction". A Data Column's extractor can also be used for "Secondary Extraction", performed after rows are detected. For more on this, please visit the #Primary VS Secondary Extraction section of this article.
  1. Rows are only detected below the detecting Data Column's header.
  2. Grooper runs the detecting Data Column's Value Extractor, looking for matching results aligned below the column header.
  3. For each result returned, Grooper establishes one row instance.
    • Since our extractor was designed to return decimal values, and Grooper found five decimal values below our column header, Grooper detected five rows.


The Tabular Layout method now has the two pieces of information it needs to determine the table's structure. If you know where the columns are and how big they are, and you know how many rows there are, you pretty much know what the table looks like. Grooper can infer the table's grid-like structure using the column and row positions.

  1. It has column instances for each Data Column.
    • Again, established by each Data Column's label.
  2. It has row instances for each detected row.
    • Again, established by the detecting Data Column's Value Extractor.
      • FYI: More than one Data Column can be used to detect rows. Please visit the #Advanced Row Detection section for more information.


With these column and row instances established, Grooper can form data instances for each cell of the table.

  1. Each cell's data simply lays where the columns and rows intersect.
    • For Data Columns with their Value Extractors configured, values are either collected using "Primary" or "Secondary Extraction". Please see the #Primary VS Secondary Extraction portion for more information.
    • For Data Columns without their Value Extractors configured, values are collected by returning the OCR or native text data within the geometric boundaries of the cell.
      • This is extremely beneficial for data that is difficult to extract using pattern matching.
      • For example, invoice item numbers and descriptions are notoriously difficult to pattern match. By using something in the table that is easy to pattern match, like our item quantities, we can use Tabular Layout to model the table structure and collect the other column values that are not.

Label Padding

When collecting labels for Data Columns the physical width of the label will help establish the width of the column. Grooper uses a variety of information on the page such as distance between column labels, whitespace gutters between the text in columns, line location data stored to a page's layout data to establish the width of a column.

However, Grooper doesn't always get things right. In these cases, you can manually adjust the width of a column using the Padding properties of the Data Column's Header label.


For example, take this line items table. Imagine we're using the "Line Total" column for row detection.

  1. If the column instance is limited to the width of label Line Total, the "Line Total" Data Column's extractor will never return a result. No text falls within the boundaries of the column.
  2. The values for the column are misaligned with the columns header.


Under normal circumstances, we simply couldn't use this column for row detection.

  1. However, using the Padding property, we can adjust the size of a Data Element's label (in this case the Data Column's Header label).
  2. This will adjust the width of the column instance, aligning the column's values within the boundaries of the column, allowing this column to be used for row detection.


  1. To adjust a label's Padding, first select the label whose width and/or height you wish to adjust.
    • We have selected the "Anfoneb" Document Type's "Line Total" Data Column's label.
  2. In our case we want to lengthen this Line Total label.
    • This will lengthen our column width, allowing the Line Total column's values to be used for row detection.


  1. Expand the Padding property.
  2. Use the Left, Right, Top, and/or Bottom properties to adjust the size of the label.
  3. We entered 0.5in for the Right padding property.
    • This extended the width of our label 0.5 inches to the right.
  4. Our line total values now fall below the "Line Items" label. The "Line Items" column can now be used for row detection.


  1. Success! Now that we adjusted the width of our "Line Items" Data Column's label, the table extracts successfully.

FYI

You may have noticed we did not pad the label to reach the true "end" of column. Rather, the width just barely overlapped with the currency values in the column.

We were able to get away with this because we were using the column for row detection. The "Line Items" Data Column's extractor was using Primary Extraction to find these values, collect them, and detect rows all at the same time.

Were this column using Secondary Extraction to collect the columns values, it's most likely we would need to further pad out the column header so that it does extend the full width of the column.

• For more information on row detection, please visit the #Advanced Row Detection portion of this article.
• For more information on Primary and Secondary Extraction, please visit the #Primary VS Secondary Extraction portion of this article.


Table Labels and Labelset Based Classification

Table headers are often very useful (even critical) for Labelset-Based classification, and it generally is the case you want to use them as a classification feature. Currently, if you want to use a Data Table object's labels for classification, you must set the Data Table's Minimum Row Count property to at least "1". This is a known issue in the current version of Grooper and likely will change.


However, if you find Data Table and/or Data Column labels are not included in determining document similarity during classification, do the following:

  1. Navigate to the Data Table object in the Node Tree.
  2. Expand the Row Count Range property.
  3. Select the Minimum property.
  4. Enter 1.

If you have multiple Data Table objects in your Data Model, you will need to repeat these steps for each one.


For more information on the Labelset-Based document classification method, visit the Label Sets article.

Advanced Setup Considerations

The Tabular Layout method is designed to extract tabular data even with the most basic setup described above. However, sometimes "basic" just isn't enough.

The challenging part of table extraction is the variety of forms a table can take. Columns can be in various orders. Table cells can be spaced well apart or jam-packed tight together. Sometimes data is required to be present for some table formats but it's optional on others. There's little consistency in how columns are labeled. Multiline row data can be challenging to target.

Grooper's Tabular Layout method has ways to overcome these issues, and more. For more complicated table structures, the Tabular Layout method has a robust suite of configurable properties. Understanding these properties will allow you to better extract a wider variety of tabular data.

In this section, we will discus the following advanced setup features for Tabular Layout:

  1. #Multiline Rows
  2. #Advanced Row Detection
  3. #Primary VS Secondary Extraction
  4. #Footer Detection

For the following tutorials, you may presume the following unless otherwise told:

  • We will continue testing table extraction using the "Line Items" Data Table from the #Basic Setup instructions.
  • Column headers have already been established (either using Label Sets or Header Extractors)
  • The "Quantity" Data Column is performing row detection. It's Value Extractor has been configured as described in the #Basic Setup
  • Line location layout data has been collected for all documents.

Multiline Rows

For many documents, the data in each row of a table occupies a single line.

The table we used in our #Basic Setup instructions had single-line rows. Indeed, single-line table structures are more basic and are typically the easiest to extract.


Multiline table structures are a little trickier.

In multiline tables, the data in one or more columns can span multiple lines. For example, the "Description" column in this table spans multiple lines (four to be exact).

This can pose a challenge for table extraction, particularly for tables with unpredictable line wrapping where sometimes a row may be single-line and others may be multiline.


But, have no fear! The Tabular Layout method can easily detect most multiline table structures by enabling the Multiline Rows property.


The default Tabular Layout settings presume all rows are single-line.

  1. This "Rechnung" Document Type has a multiline table.
  2. Upon testing extraction, note only the first line for each row in the "Description" column is collected.
  3. The remaining three lines in the "Description" cells are ignored.


This is what the Multiline Rows property is for. Enabling this property will allow you to target table structures like this whose rows extend beyond just a single line on the page.

  1. To enable Multiline Rows, first expand the Tabular Layout sub-properties.
  2. Switch the Multiline Rows property to Enabled.


  1. The Tabular Layout method now appropriately detects the rows occupy multiple lines on the document.
  2. The full line item description is now properly extracted by the Data Table.


The Multiline Rows functionality will even detect multiline rows if the lines start on one page and continue to the next.

  1. Make sure Multiline Rows is enabled.
  2. In the subproperties of Multiline Rows, set the Detect Page Wrap property to true.

Detect Stacked Layout

There is a special variety of multiline structured tables called a "stacked layout" table. In these tables, you will find two different pieces of information stacked on top of one another in the same column.


For example, in this table, the "Item Number" and "Description" column headers are both contained within the same column, with "Item Number" and stacked on top of "Description".

  • "Item Number" is highlighted in orange.
  • "Description" is highlighted in yellow.


Their corresponding values are also stacked on top of each other in each row. The item numbers in each row are stacked on top of the description from that item.

  • The item number values are highlighted in orange.
  • The item description values are highlighted in yellow.


In these situations, the Detect Stacked Layout property can help get the right values in the right columns with no additional extraction configuration.


With Multiline Rows enabled, you can choose to enable or disable the Detect Stacked Layout property.

  1. Detect Stacked Layout is Disabled by default.


Here, we are using the default configuration with Multiline Rows enabled.

  1. The "Envoy" Document Type is a good candidate for the Detect Stacked Layout feature.
  2. We've collected the header Item Number for the "Item Number" Data Column
  3. We've collected the header Description for the "Description" Data Column


These two header labels are stacked on top of each other, as is their data in each row.


Without Detect Stacked Layout enabled, we've got some problems.

  1. This is the normal Multiline Rows behavior.
    • Grooper determined correctly these rows spanned multiple lines. The cell is populated with all lines.
    • However, this is not what we want.
  2. For each row, the first line (and only the first line) should be in to the "Item Number" column.
  3. And, the second line (and only the second line) should be in the "Description" column.


Because the "Item Number" header is stacked on top of the "Description" header, we can presume the first line belongs in the "Item Number" column and the second belongs in the "Description" column.


The Detect Stacked Layout property will put the data from the appropriate line into the appropriate column according to how the labels are stacked.

  1. To enable Detect Stacked Layout expand the Multiline Rows sub-properties.
  2. Change Detect Stacked Layout to True.


  1. Now, only the first line is collected for the "Item Number" column.
  2. And, only the second line is collected for the "Description" column.


FYI

This would have been a very good situation for Data Element Overrides. Indeed, given Tabular Layout's multitude of configuration options, most users will find themselves using multiple Document Types and Data Element Overrides to fine tune extraction logic based on a variety of table formats.

Given that this "Envoy" Document Type is the only one who can make use of the Detect Stacked Layout functionality, we really should have made this configuration using Data Element Overrides. This will prevent unintended consequences on other Document Types where the Detect Stacked Layout feature does not provide a benefit (or impedes accurate extraction).

We should have enabled Detect Stacked Layout as an override performing the following steps:

  1. Select the Document Type whose override you want to configure.
    • The "Envoy" Document Type in this case.
  2. Navigate to the "Overrides" tab.
  3. Select the Data Table
    • The "Line Items" Data Table in this case.
  4. Turn the Detect Stacked Layout property to True.

By enabling Detect Stacked Layout using the "Envoy" Document Type's overrides, it will ensure only document's classified as "Envoy" will use the configuration.

Advanced Row Detection

A Data Column's Value Extractor is going to extract data in one of two ways:

  1. Primary Extraction
    • Primary Extraction is for row detection. In this case, the extractor runs at the document level, looking for potential rows beneath the Data Column's header.
  2. Secondary Extraction
    • Secondary Extraction happens after rows are detected. After row instances are formed, After cell instances are formed. In this case, the extractor runs at the instance level to further parse table cell or row data.

This section is all about Primary Extraction (We'll talk more about the differences between Primary and Secondary Extraction in the #Primary VS Secondary Extraction section). This section is all about using Data Column extractors to locate and form row instances.

In the #Basic Setup section, we demonstrated a simple example of how a single Data Column's extractor detects rows. However, more complicated table structures require more complicated solutions.

In this section we will discuss:

Row Detection Using Multiple Columns

Going back to our #Basic Setup example: Why did we use the "Quantity" Data Column for row detection?

Simple enough answer: There were quantities present on every row. Plus, quantity values are a lot easier to pattern match than something like an item number or a description.

However, we could have used other columns for row detection. For example, you'd expect there to be a "Unit Price" or "Line Total" value in the rows of line item table as well. And, currency values are about as easy to pattern match as quantity values.

You can use not just one but multiple column values to form row instances. This can be an effective way to throw out false positive rows. Using multiple columns to detect rows, you're effectively saying you need a value present in Column A and Column B to detect a row.

You can use as many columns as you need to detect rows. You can configure table extraction so that a value would need to be present in Column A and Column B and Column C and so on.

You can also configure Tabular Layout in such a way that columns can be optionally used to detect rows. You might have a situation where as long as a value is present in Column A or Column B the row should be considered valid and detected.

In either case, when using multiple columns to detect rows the Minimum Cell Count property becomes extremely important. Once you're finished with this section, please be sure to read #The Minimum Cell Count Property section of this article for more information.


  1. For example, look at our initial results for this "Nama" Document Type.
  2. As far as the Tabular Layout settings go, we've enabled Multiline Rows and that's it.
    • However, Multiline Rows is agnostic to row detection. It has nothing to do with detecting rows, only enlarging them to include wrapped lines between detected rows.
  3. Using the "Quantity" column alone for row detection, we have collected a false-positive row instance.
    • This row is not a valid row. We need to throw it out.


Why did this happen? It's because we used the "Quantity" column to detect rows.

  1. The "Quantity" Data Column's Value Extractor is a very generic extractor.
    • It will match most numeric as well as currency values.
  2. When the extractor runs within the boundaries of the "Quantity" column, it certainly matches the three numeric quantities listed in the three table rows.
  3. However, it also matches this value below the table.
    • This is the result giving us the false positive row. Because the extractor returns a value within the boundaries of the detecting column, Grooper forms a row instance.


If we use multiple columns to detect rows, we can avoid this issue.

For this table, each row has both a "Unit Price" and a "Quantity" value in every row.

  • It's just the "Quantity" column giving us the issue on this document.
  • There is no matching false positive value in the "Unit Price" column.

If we used both columns to detect rows, we're effectively saying each row must have both a "Quantity" value and a "Unit Price" value to be considered valid.

  • Even though there is a matching "Quantity" result in the false-positive row, there is not a "Unit Price" result.
  • Therefore, if we use both columns to detect rows, the false-positive result would be thrown out.

Furthermore, for all (or certainly most) invoice table formats, we would expect both unit price values and quantity values listed for each row. Configuring two-column row detection would not only help detect rows for this table format in particular, it's likely to help detect rows from other formats as well.


  1. All we need to do is configure the "Unit Price" Data Column to perform row detection.
  2. We will configure its Value Extractor, referencing the Value Reader we saw earlier matching numeric/currency values.


  1. With both the "Quantity" and "Unit Price" Data Column's Value Extractor properties configured, a value is required in both columns for a row to be detected.
  2. This throws out our false-positive match from earlier, when only the "Quantity" Data Column's Value Extractor was configured.

The Minimum Cell Count Property

The Minimum Cell Count property is extremely important when using multiple columns to detect rows.

  1. In the Tabular Layout sub-properties, this property is located in Row Detection sub-properties.
  2. The Minimum Cell Count property's default value is 3.
    • This means a minimum of 3 columns values must be present in order to detect a row.
    • So, if you have 5 Data Columns whose Value Extractors are configured, only 3 of their values would need to be present to detect the row and form a row instance.


There is, however, a caveat if you have less than the minimum value of Data Columns with configured Value Extractors.

  1. For example, we currently only have two Data Columns with configured Value Extractors.
    • The "Quantity" and "Unit Price" Data Columns.
  2. Two is less than three (the default Minimum Cell Count).
  3. But, we're still collecting table data.

Since only two Data Columns' extractors are configured, we don't actually reach the "minimum" of "3". The Tabular Layout method will account for this and still extract the table data, presuming a value from the two columns must be present out of the three possible "minimum" cells.

  • It's when you go over the minimum cell count value in terms of the number of Data Columns with configured Value Extractors that this property really comes into play.


Next, we're going to look at the Minimum Cell Count property where the number of Data Columns with configured Value Extractors does exceed the minimum cell count value (or will eventually by the time we're done).

Correctly manipulating the Minimum Cell Count property can be critical to establishing your row detection logic.

  1. Let's look at the table from the "Factura" Document Type.
  2. This invoice should have four rows.
  3. However, as configured currently with the "Quantity" and "Unit Price" Data Columns performing row detection, we're only detecting two rows.


Furthermore, we've got another issue due to Multiline Rows being enabled.

  1. Our extended price ("Line Total") value for the first row should be "40,700.00" not "40,700.000.00"
  2. This cell is consuming the "0.00" text from what should be the second row.


All of this can be resolved with better row detection.


  1. First, lets fix the problem with the "Line Total" Data Column's value.
  2. If we configure the "Line Total" Data Column's Value Extractor, it will match the dollar amount in this row properly.
  3. Here, we've configured the Value Extractor property to reference that same Value Reader matching numeric/currency amounts.


Think about it. This is the text cell extracted for the extended price.

40,700.00
0.00

This is not a valid currency value (Or technically, it's two currency values stacked on top of each other).

This, however, is a valid currency value:

40,700.00

By configuring the "Line Total" Data Column's extractor, we've added one more rule to detect valid rows. In order for a row to be detected, all the following conditions must be met:

  • You must have a matching result in the "Quantity" column.
  • You must have a matching result in the "Unit Price" column.
  • You must have a matching result in the "Line Total" column.


  1. Now we've extracted the correct value for the "Line Total".
  2. However, we're still only returning two rows.
  3. We're going to use the Minimum Cell Count Property to fix this.


Because we now have three Data Columns whose Value Extractors are configured, we have met the met the minimum cell count of "3".

That means only the two rows where a value from the "Quantity", "Unit Price" and "Line Total" columns are present are being detected as valid rows.


The truth is this table structure is a little non-standard in two ways.

  • Whereas this table lists a zero dollar amount in the "Line Total" (Extended Price) column, it leaves the cell blank in the "Unit Price" column. Since there's no value there, Grooper passes it over for row detection.
  • While the shipping cost is listed in the table for this invoice, the "Quantity" (Qty Shp.) is left blank.

In both cases, one of the three column values required for detection are missing. However, in all cases two of the three values are present for each row. We can use the Minimum Row Count property to change our detection logic a bit.


  1. We can successfully extract every row in this table by dropping the Minimum Cell Count value to 2.
    • Remember, we have three Data Columns' extractors configured, meaning three can potentially be used to detect rows.
    • With the Minimum Cell Count set to 3, all three values from all three columns must be present to detect a row.
    • By dropping it to 2, only two of the Value Extractors from configured Data Columns must return values to detect a row.
      • A row with a "Quantity" value and a "Unit Price" value would be detected.
      • A row with a "Unit Price" value and a "Line Total" value would be detected.
      • A row with a "Quantity" value and a "Line Total" value would be detected.
      • A row with a "Quantity" value, a "Unit Price" value, and a "Line Total" value would be detected.
      • A row with a "Quantity" value alone? Nope. Not a valid row. Doesn't meet the minimum of "2".
  2. With this change to our row detection logic, all four rows are collected.


FYI

This would be another good example of when to implement Tabular Layout adjustments via Data Element Overrides, rather than using the globally extracted Data Table.

For most of our Document Types in this set, using our three Data Column extractors and a Minimum Cell Count of 3 actually works really well as far as row detection goes.

• The "Factura" Document Type doesn't fit the normal model. It works better with a Minimum Cell Count of 2.
• Therefore, the adjustment to the Minimum Cell Count should be made in the "Factura" Document Type's overrides.

Row Detection Limitations with Multiline Rows

There is one strict limitation to Grooper's row detection when you're dealing with multiline rows. In order to detect a row, ALL values must be present on the same line.

Tables with multiline rows generally exist in two flavors (or a Neapolitan combination of the two):


  1. Rows are multiline because the text within a cell wraps to the next line.


  1. Rows are multiline because the columns have a stacked layout.


There's a variety of ways Grooper handles stacked column data in multiline rows. We've already seen the Multiline Rows feature's Detect Stacked Layout option (See here for more details).

  • FYI: We'll see more ways to handle data stacked within a table cell in the Secondary Extraction portion of this article.


However, you should always keep in mind the Multiline Rows feature has absolutely nothing to do with detecting rows. Grooper must detect a row first before it implements the Multiline Row logic to expand the row instance across multiple lines of text. For tables with a stacked column layout, row detection can prove challenging if you are using multiple Data Columns to detect rows using data on separate lines.

  • In order to detect a row, ALL values must be present on the same line.


For example, take this invoice line items table format with stacked columns.

In our Data Table, three of our Data Columns extractors are performing row detection.

  1. The "Quantity" Data Column, labeled as QUANTITY here.
  2. The "Unit Price" Data Column, labeled as UNIT PRICE here.
  3. The "Line Total" Data Column, labeled as TOTAL


The problem, as far as row detection goes, is two of these column values are on the same line, but one is on a separate line.

  • The "Quantity" and "Line Total" values are on the first line of the row.
  • The "Unit Price" value is on the second line of the row.


Grooper will not be able to detect rows (and therefore won't collect table data) as we have Tabular Layout configured currently.


If we try to extract this table, as configured, we will get no results whatsoever (because no rows are detected).

  1. Testing extraction.
  2. We get no result.
  3. FYI Enabling Multiline Rows has nothing to do with row detection.
  4. FYI Enabling Detect Stacked Layout has nothing to do with row detection.
    • These properties will be helpful in modeling the row structure, but won't do anything if we're not detecting rows in the first place!


How are we going to fix this? There's two ways we could approach this problem:

  1. By adjusting the Row Detection > Minimum Cell Count property.
    • As we've seen before, when you adjust this property, such that the number is less than the number of Data Columns with configured Value Extractors, it makes Data Columns optional when it comes to row detection.
    • If we lowered this to 2, only two of our three columns would be required for row detection. The "Quantity" and "Line Total" columns' values are on the same line. Therefore, we would detect our rows.
  2. By disabling row detection for the "Unit Price" Data Column.
    • This may sound whacky, but it will be highly effective for our situation here. What's the problem here? Row detection due to a stacked column layout. Specifically, one Data Column's value is on the second line of the row (the "Unit Price" column).
    • However, we have data we can use for detection on the first line (the "Quantity" and "Line Total" columns).
    • All we have to do is tell Tabular Layout, "Don't use the "Unit Price" column's extractor to detect rows.", and we will start to collect our table data.
      • FYI: You might already be asking yourself "If we disable the column's extractor for row detection, why don't we just remove it?" That's because we are going to use it. For Secondary Extraction. After we talk about disabling a Data Column's extractor for row detection, this will lead us into a discussion about Tabular Layout's Secondary Extraction capabilities in the #Primary VS Secondary Extraction section.

Disabling Row Detection

The previous example is a good one to point out how to disable row detection for a specific Data Column (and why you'd want to in the first place).


To recap:

This table presents a problem for row detection due to its stacked column layout.

  • Our Data Table's "Quantity" "Line Total" and "Unit Price" Data Columns are configured to perform row detection.
  • The "Quantity" and "Line Total" values exist on the first line of each row.
  • Whereas, the "Unit Price" values exist on the second line of each row.

Because the values exist on different lines, Tabular Layout' cannot detect the rows.


However, if we only used the "Quantity" and "Line Total" columns to detect rows, we would have no issue.

  • The "Quantity" and "Line Total" Data Columns' Value Extractor configurations would detect the rows.
  • With Multiline Rows enabled, the detected row would then be extended to capture the second line.


All we need to do is disable row detection for the "Unit Price" Data Column, using the Tabular Layout method's Column Settings properties.


Generally speaking, once you start configuring the Column Settings properties, you're doing so because you have a large number of table formats represented by a large number of Document Types. In most cases, you will adjust these properties per Document Type using Data Element Overrides.

Going forward, when adjusting the Column Settings in this tutorial, we will do so using a Document Type's overrides instead of configuring the global Data Table object.


  1. We will demonstrate disabling row detection by disabling the "Unit Price" Data Column's row detection for the "Daftari" Document Type.
  2. Navigate to the "Overrides" tab to override the Data Table's configuration for the selected Document Type.
  3. Select the Data Table.
  4. Expand the Tabular Layout sub-properties.
  5. Select the Column Settings property.
  6. Press the ellipsis button at the end of the property.


  1. This will bring up the Column Settings editor.
  2. The Column column lists the Data Columns in your Data Table. Select the Data Column you wish to configure.
    • In our case we want to disable row detection for the "Unit Price" Data Column.
  3. To disable row detection for the selected Data Column, change the Row Detection property to Disabled.
    • This will prevent the Data Column's Value Extractor from performing Primary Extraction, forcing it to use Secondary Extraction instead. For more on Secondary Extraction, visit the #Primary VS Secondary Extraction portion of the article.
  4. Press OK when finished.


  1. With Row Detection Disabled for the "Unit Price" Data Column in the Column Settings, Grooper can now detect rows for this table format.
  2. Grooper successfully detects the three rows present on the document.
  3. There is however an issue with the extracted data in our "Unit Price" Data Column.
  4. The entire cell's text is collected, not the unit price listed inside the cell.
    • This is at least a better problem than the one we had before.
    • Previously, we weren't getting any data for any columns in any rows.
    • Now, we're at least have row instances to work with and we're getting most of our table data. Furthermore, the data we want is contained within the cell. We just need a way of extracting it.
      • With the data we want present in each cell, we can extract the data (the unit price currency listed) using Secondary Extraction.


FYI

The Column Settings > Row Detection property can be set to one of the following values:

Optional
Required
Disabled

Optional is the default setting. This means the Data Column will be used for row detection, but is not required.

• Imagine your Data Table's Row Detection > Minimum Cell Count property is set to 3 and you have 5 Data Columns whos Column Settings > Row Detection properties are set to Optional.
• If all five of those Data Columns extractors produced results on a line, the row would be detected.
• If any two of those Data Columns extractors failed to produce a result, but the other three did return a result, the row would still be detected.
• An optional Data Column can potentially be used for row detection, but if it fails to return a value, the row can still be detected. As long as enough other Data Columns produce results (such that the number of Data Columns returning a result meets the Minimum Cell Count value), the row will be detected.
• Refer to this section of the article for more information on how the minimum cell count effects row detection.

Required will strictly force a Data Column to be used to for row detection.

• Imagine your Data Table's Row Detection > Minimum Cell Count property is set to 3 and you have 4 Data Columns whos Column Settings > Row Detection are Optional, but one ("Column A") is set to Required.
• If all five of those Data Columns extractors produced results on a line, the row would be detected.
• If two of the optional Data Columns fail to produce a result, but the required Data Column and remaining two Data Columns do, the row would be detected.
• If four of the optional Data Columns extractors produced results, but the required Data Column's extractor did not, no row would be detected.
• The required Data Column(s) must return results in order to detect a row.

Disabled will exempt a Data Column from row detection.

• Instead of using Primary Extraction, it will use Secondary Extraction.
• We will discuss Secondary Extraction in the next section of this article.

Primary VS Secondary Extraction

Primary Extraction and Secondary Extraction refers to how a Data Column's Value Extractor extracts table data.

There are three things you need to be clear on to understand the differences between Primary Extraction and Secondary Extraction.

  1. data instances
  2. What a data instance is
  3. DATA INSTANCES


It really all boils down to data instances. The Tabular Layout method subdivides a table into data instances in a variety of ways: first into column instances, second into row instances and third into cell instances. At the end of the process, Grooper has everything it needs to collect data using these sub-instances.


For Primary Extraction, the Data Column's extractor executes within the column instance.

  • Primary Extraction is utilized for row detection, which is the process of forming row instances.


For Secondary Extraction, data is collected from the table using the instances established after rows are detected. This is done in one of following ways:

  • The Data Column's extractor executes within the cell instance.
    • Secondary Extraction is employed to parse data within a cell, after rows are detected and the table's structure is established.
  • The entire text within the cell is collected.
    • When Secondary Extraction isn't used to parse data within a cell, Secondary Extraction can simply collect all data for the cell instance.
  • Less commonly, the Data Column's extractor executes within the whole row instance.
    • Secondary Extraction can also be configured in such a way that extraction occurs at the row-level rather than the cell-level.


Secondary Extraction is useful for further parsing table data once rows have already been detected and cell and row instances are formed.

For example, we had an issue in the previous section where rows were detected but column's value were not extracted correctly.

Due to an issue with the table's stacked column structure, we couldn't use the "Unit Price" Data Column for row detection. So, we disabled row detection for that column in the Column Settings properties. This prevented the Data Column from performing Primary Extraction.

Instead, it is falling back on Secondary Extraction.

Secondary Extract will attempt to execute the Data Column's Value Extractor inside the cell instance rather than the column instance. If that extractor fails to return a result, the entire text within the geometric boundaries of the cell is returned instead.

Currently, we're simply returning all the text within each cell for each cell for each row for the "Unit Price" column. This isn't what we want to collect.

However, the value we do want (the dollar amount) is fully encapsulated within the cell. We just need to extract it from the text present in the cell.


  1. The "Unit Price" Data Column does currently have its Value Extractor configured.
  2. It's using our same generic numeric/currency extractor we've been using through this article to match numeric values.
    • All we need to do is ensure this extractor can property extract data from the cell.


This brings up a common issue when performing Secondary Extraction. Always be aware of the instance-level you are extracting.

  1. This is the extractor the "Unit Price" Data Column references.
  2. It certainly seems like it's matching the dollar amounts in the "Unit Price" column.
  3. However, there is an issue with this Suffix Pattern when the extractor runs Secondary Extraction in the "Unit Price" column's cells.

When run globally on the document, it would make sense to expect a space character after the number. However, when you get down to the cell instances for the "Unit Price" column, there is no space character.


The Suffix Pattern doesn't match within the cell. Instead of there being a space character present, there's just nothing. The text data terminates at end of the number itself. When run using Secondary Extraction, this extractor fails to produce a result.


We just need to update this extractor so that it will match within the cell during Secondary Extraction.

  1. In this case, we just need to ensure our numeric regex pattern will match whenever it is suffixed by a space character or the end of string anchor character $
    • \s|$
  2. FYI It's very common to use end of string characters $ in your Suffix Patterns as well as beginning of string characters ^ in your Prefix Patterns when relying on Secondary Extraction.


With this minor change to the extraction logic, the extractor will now property execute with in the cell whenever Secondary Extraction is performed.

  1. After testing extraction, you can see we are accurately extracting the values in each row for the "Unit Price" column.
  2. The "Unit Price" Data Column's Value Extractor runs during Secondary Extraction, executing against the cell instance after rows are detected.
    • Now that we adjusted the extractor to match within the cell instance, we get the value we want.

Secondary Extract Modes

There are three ways in which Secondary Extraction can be performed, called Secondary Extract Modes. These modes can be configured Data Column by Data Column using the Tabular Layout > Column Settings > Secondary Extract Mode settings.

  1. Cell Extract
    • For the Cell Extract mode, the Data Column's Value Extractor executes within the table cell's text contents.
    • This is useful to parse a smaller amount of data from a larger amount of data within a table cell.
    • Or, you may use an extractor to manipulate text within a cell, such as to cleanse the data using Fuzzy RegEx.
  2. Geometric
    • The Geometric mode extracts all text within the physical boundaries of the cell.
    • Data Columns with no Value Extractor configured are using the Geometric method to collect data for the cell.
    • This is useful to collect data that is difficult to pattern match.
  3. Row Extract
    • The Row Extract mode executes the Data Column's extractor against the full text of the row instance (not the cell instance).
    • This is the least common Secondary Extract Mode. Typically, this mode is used as a last resort due to atypical table structures.

Auto VS Cell Extract VS Geometric

The default value for the Secondary Extract Mode is Auto. "Auto" will attempt to use the Cell Extract mode, but will fall back on the Geometric mode as a failsafe.

  • Auto first attempts to use Cell Extract. If the Data Column's extractor returns a match within the cell, its result will be returned.
  • If the Data Column's extractor fails to return a match within the cell, Auto will use the Geometric mode. All text within the geometric boundaries of the cell will be returned.

This is exactly what happened in our previous example.


At first, the text data was returned using Geometric mode.

  1. The "Unit Price" Data Column's extractor executes against the cell.
  2. The extractor did not match anything in the cell's text data.
  3. So, Geometric mode was used, returning all text within the physical boundaries of the cell.


Then, we fixed the "Unit Price" Data Column's extractor so it would match within the cell.

  1. The "Unit Price" Data Column's extractor executes against the cell.
  2. The extractor does match text in the cell.
  3. So, Cell Extract mode was used, returning only the extracted result.


You can, however, force a Data Column to only ever use either Cell Extract or Geometric by configuring the Table Extraction > Column Settings > Secondary Extract mode property for one or more Data Columns.

Row Extract

The Row Extract mode allows you to execute a Data Column's extractor against the row instance rather than the cell instance. There are two main reasons to do this:

  1. The table's structure is atypical and Grooper was not able to appropriately find the divisions between columns.
  2. You need to extract data that is in each row but not labeled by a column header.

In either case, it may be difficult (or even impossible) to extract the data you want out of a specific cell within a row. However, it may be possible to extract the data from the row itself.


For example, imagine we wanted to find the "Unit" column for our line item tables as well.

The unit "EA" is listed clearly for each item in the row. However, there is no column header labeling this column. There's nothing like "Unit" or "Unit of Measure" or "UOM" we present labeling the column.


Furthermore, because this table has line layout data, neither the "Unit Price" nor the "Line Total" columns would ever contain this value within their cell instances for any row.

  • Sometimes you can get away with using a different column's header label, even using the same one that's already been used by another Data Column. This will not be the case here.


However, the data is always present in each row, and Grooper easily detects each row in this table.

We can still extract the unit of measure from the row instance, using the Row Extract Secondary Extract Mode.


Next, we're going to configure Tabular Layout so that the "Racun" Document Type will use the Row Extract mode to extract the unit of measure value from each row in its invoices' line items tables.

  1. We have created a Value Reader to match units of measure.
  2. This is a very basic List Match extractor, matching common units like "each" or "EA".
  3. We have added a "Units" Data Column and assigned this Value Reader as its Value Extractor.
  4. Ultimately, we will use this to extract the unit values from each row instance.


If we test extraction against our sample document, we will get everything but the "Unit" column.

  • Commonly, you will configure Secondary Extract Modes as override changes for a Document Type, which is what we're choosing to do here.
  1. We've selected the "Racun" Document Type.
  2. We've navigated to the "Overrides" tab.
  3. Testing out extraction, we have nothing populated for the "Unit" column.
    • This shouldn't be surprising. We have not established a column header for this Data Column because there is no header label to collect!
  4. We will use the 'Column Settings' properties to force the "Unit" Data Column to perform Secondary Extraction, using the Row Extract mode.


  1. Select the Data Column you wish to configure.
    • The "Unit" Data Column, in our case.
  2. To enable the Row Extract mode, change the Secondary Extract Mode to RowExtract.
  3. Press OK when finished.

FYI

You should also consider editing the Row Detection and Secondary Extract properties at this point.

Are you ever going to use this column to detect a row for the Document Type? NO

• You should set the Row Detection property to Disabled in this case.

Do you always expect to use the Row Extract mode to find the units for this Document Type? YES

• You should set the Secondary Extract property to Always in this case.


  1. With the Row Extract mode enabled, we collect unit values for the "Unit" column.
  2. The "Unit" Data Column's extractor now executes against each full row, when Secondary Extraction is performed.
  3. Click the Inspect button before moving on.


FYI

The Instance Viewer is a tool to better understand the instances created and used in table extraction.

The Instance Viewer can be extremely beneficial when configuring Secondary Extraction. Whether you're fine tuning Cell Extract, trying get a closer look at the Geometric text data, or trying to set up a Row Extract extractor, the Instance Viewer will be your best friend.

  1. Expand your Data Table to view the various instances created during table extraction.
  2. The first level in the hierarchy will be row instances.
  3. The next level in the hierarchy will be cell instances.
  4. The "Image View" tab will highlight the selected instance's physical location on the document.

Footer Detection

A "footer" is a text table that indicates where the table stops. Some tables will have footers, and some won't. When the table does have a footer, Grooper can use this information to force-stop row detection.


For example, as our Data Table using Tabular Layout is configured currently, we've collected one row we shouldn't have for the "Sonrasc" Document Type's line items table.

  1. There are only actually two rows on this document.
  2. However, we collected three.

Why? The "Quantity", "Unit Price" and "Line Total" Data Columns are all being utilized for row detection.

  1. We have three matching results for all three of those columns on this line.
    • As far as Tabular Layout is concerned, this counts as a row.


We will fix this issue using a footer. By defining a footer for this table, we can dictate where the table should end based on static text labels on the document.

  1. For example, this phrase "THANK YOU FOR YOUR ORDER" is always found at the end of the line items table from this vendor.
    • Once we assign this phrase as the table's footer, Grooper will stop detecting rows once it reaches this point.

You can define a footer in one of two ways:

  • Using an extractor, by configuring the Tabular Layout > Footer Detection property.
  • Using Label Sets, by collecting the Data Table's Footer label for one or more Document Types.

Collecting a Footer Using an Extractor

To establish a footer, using an extractor, you will configure the Footer Detection property.

  1. Select the Data Table you wish to configure.
  2. Expand the Tabular Layout sub-properties.
  3. Select the Footer Detection property.
  4. Using the dropdown list, select the extractor (Extractor Node or Value Extractor) you wish to use.
    • We're going with a List Match extractor for this tutorial.


Configure the extractor to match something at the foot of the table.

  1. Our list entry will be THANK YOU FOR YOUR ORDER
  2. This will match something on the document that's found at the end of this table (at least for this vendor).


  1. With the Footer Detection extractor configured, we will throw out the false positive row detected after our footer result.
  2. This row is after our footer. So, it is no longer detected.
  3. The two valid rows are collected accurately.

FYI

Keep in mind the Footer Detection property is a global property. It will be applied to all Document Types (unless overridden using Data Element Overrides).

Collecting a Footer Using Label Sets

Table footers can be established using Label Sets by collecting a Footer label for the Data Table.

  1. Navigate to the "Labels" tab of your Content Model.
  2. Select a sample document assigned the Document Type whose labels you want to collect.
    • Or manually assign it the Document Type if not done so already.
  3. Select the Data Table in the list of Data Elements.
  4. Select the Footer tab.
  5. Collect the text label you wish to use as the footer.
    • In this case we've lassoed the text THANK YOU FOR YOUR ORDER.
  6. Don't forget to save when finished.


  1. With the Data Table's Footer label collected for this Document Type, we will throw out the false positive row detected after our footer result.
  2. This row is after our footer. So, it is no longer detected.
  3. The two valid rows are collected accurately.
  4. There is no need to configure the Footer Detection property when using Label Sets.
    • The collected Footer label effectively supplants the Footer Detection property.

FYI

The Label Set approach is, in general, a more "templated" approach. You will need to collect a Footer label for each Document Type that needs one.

Capture Footer Row VS Display Total Row

FYI: The Capture Footer Row property was introduced in version 2021.0046. Earlier minor versions do not have this property.

The Capture Footer Row property creates a row instance at the bottom of the table, using the footer to establish the row.

  • This row is ONLY for the benefit of a document reviewer. This data IS NOT actually collected as part of the table's data.


  1. First Grooper will locate the footer.
    • In this case we used a Footer label SUBTOTAL
  2. Then, Grooper will create a row instance using the footer, instead of Tabular Layout's normal row detection methods.
  3. This is now a row instance. If there is anything that can be extracted by a Data Column's extractor, it will be.
    • In our case, the "Line Total" Data Column's extractor returned the numerical value in this row.
  4. Extracted values are then displayed in the "footer row" at the bottom of the table.

Values in these footer rows may be useful for your data reviewers. Often there are column totals that can be extracted from a footer row and used to validate information in the table rows above it.


Be aware, the Capture Footer Row is enabled by default.

  1. The Capture Footer Row is set to True by default.
  2. You will need to set this to False if you do not want to display the footer row when reviewing Tabular Layout's extraction results.


Please be aware the Capture Footer Row is in some ways similar to the Display Total Row feature, but is exceptionally different from it in one major way.

  • The Capture Footer Row creates a row instance that is actually extracted against, using the document's text data.
    • The data is generated during and as a part of extraction.
  • The Display Total Row displays a row, adding up numerical values collected for one or more columns.
    • The data is generated after extraction.


The Display Total Rows feature adds a row to the bottom of the table using solely a mathematical operation.

  1. No document extraction is performed to populate the row.
  2. Instead all the column values for one or more defined Total Columns are added together.
  3. The result is displayed in the "total row" at the bottom of the table.


The Display Total Rows feature is also useful for document reviewers. It just gets its results differently than the Capture Footer Row feature.


Capture Footer Row will supersede the Display Total Rows if both are enabled.

  1. For this Document Type the following Footer label was collected:
    • SHIP Shipping
    • The idea being the shipping value would always be listed on the last line of the table and should not be collected as part of the line items table data.
  2. The Capture Footer Row property is set to True
  3. The footer row instance is generated using the Footer Label.
  4. Extracted values are displayed in the "footer row" at the bottom of the table.
  5. However, Display Total Row property is also set to True.


You can only have either a "footer row" or a "total row", not both.

  • If both properties are enabled (set to True), Capture Footer Row takes priority and a "footer row" will be displayed.
  • If you want to display a "total row", you must set the Capture Footer Row property to False.


  1. When using Generate Footer Row be sure to select a Data Column...
  2. ... and set the Footer Mode property to Calculate.