2021:Tabular Layout (Table Extract Method): Difference between revisions

Revision as of 16:14, 16 May 2022

WIP

This article is a work-in-progress or created as a placeholder for testing purposes. This article is subject to change and/or expansion. It may be incomplete, inaccurate, or stop abruptly.

This tag will be removed upon draft completion.

Tabular Layout is one of Grooper's methods of extracting table data from documents available to Data Table objects (via its Extract Method property). This method uses column header values determined by the Data Columns Header Extractor results (or labels collected for the Data Columns when a Labeling Behavior is enabled) as well as Data Column Value Extractor results to model a table's structure and return its values.

The Tabular Layout is "Label Set aware". You can configure Tabular Layout with or without labels. This article will detail both methods. For more information on Label Sets, please visit the full Label Sets article.

About

Many tables label the columns so the reader knows what the data in that column corresponds to. How do you know the unit price for an item on an invoice? Typically, that item is in a table and one of the columns of that table is labeled "Unit Price" or something similar. Once you read the labels for each column (also called "column headers"), you the reader know where the table begins (below the column headers) and can identify the data in each row (by understanding what the column headers refer to).

This is also the basic idea behind the Tabular Layout Extraction Method. It too utilizes column header labels to "read" tables on documents, or at least as step number one in modeling the table's structure. Once Grooper knows where a column is, identified by the column's header label, Grooper can extract data from each cell in each row of that column.

The Tabular Layout method can establish column header locations in one of two ways:

Using extractors
- Which are defined on the Data Columns' Header Extractor property (or alternatively on the Data Table's Header Row Extractor property)
Using Label Sets
- When a Labeling Behavior is enabled, column header locations are defined by labels collected for the Data Columns (and optionally for the Data Table)
- Effectively, the labels take the place of the Header Extractor results (or alternatively the Header Row Extractor results)

Once the column header locations are established, the next thing Grooper needs to do is figure out where each row is. Tabular data is most often dynamic data. A table on one document might have two rows. The same table on the next might have twenty. How does Grooper know where each row is?

This is done by configuring at least one Data Column's Value Extractor property (However, more than one, even all, may be configured. Depending on how complicated the table is, you may need to configure extractors for multiple columns.)

Generally, there is at least one column in a table that is always present for every row in the table. If you can use an extractor to locate that data below its corresponding column header, that gives you a way of finding each row in the table. This allows Grooper to form a "row instance" for each row. Once the row instance is established, Grooper can then collect the various cell values for the various additional columns from the row instance.

If locating column headers and locating rows using column extractors was was all that was involved in Tabular Layout, that alone would make it a powerful tabular extraction method. What makes the Tabular Layout method even more powerful is its further configurability. Is every row in the table a single line or are the rows "multiline"? Do you need more fine-tuned data extraction from a cell's value or the row itself once the row instance is detected? Do you need to establish a table "footer" to limit the number of rows extracted? We will address these issues and more in the #Advanced Setup Considerations section of this article.

FYI

If your familiar with the Header-Value table extraction method, you should see some similarities between it and the Tabular Layout method. Indeed both methods utilize column headers and Data Column Value Extractors to collect table data.

Tabular Layout should be seen as an improvement on Header-Value for the following reasons:

Tabular Layout is Label Set aware.
Tabular Layout is typically less involved to set up.
Tabular Layout has more configuration options, giving it a better capability to extract data from a large set of disparate table structures (Usually executed through Data Element Overrides).

Basic Setup

Tabular Layout can be configured with or without the use of Label Sets. In either case, the basic setup is the same:

Establish column headers for each Data Column.
Detect row instances by assigning at least one Data Column's Value Extractor.
Set the Data Table's Extract Method property to Tabular Layout.
Test extraction and configure further as necessary.

With Label Sets or without, the setup is extremely similar. On top of that, there's nothing about using Label Sets that alters Tabular Layout's extraction logic. Grooper uses the same logic to model the table's structure and collect data for each cell. The biggest difference is how column headers are determined in step #1.

Without Label Sets, column headers are established using extractors, defined using the Data Columns' Header Extractor property (or alternatively using the Data Table's Header Row Extractor property)
With Label Sets, column headers are established using labels, defined when collecting labels for each Document Type. The Data Columns' labels effectively take the place of the Header Extractor property's results.

Tabular Layout Without Label Sets

The Tabular Layout Method1. Configure Header Extractors2. Assign a Data Column's Value Extractor3. Set Extract Method to Tabular Layout and 4. TestAlternative Configuration: Header Row Extractor

The Tabular Layout Method

This tutorial will cover the basic configuration of the Tabular Layout method without Label Sets, using extractors to collect column headers instead. We will use invoices for our document set and collect the following data from their tables detailing line item information:

Item Number - The vendor's id number for the item ordered for each row.
Description - The description of each item ordered for each row.
Quantity - The number of the item ordered for each row.
Unit Price - The vendor's price for the item ordered for each row.
Line Total - The total price for the number of items ordered (In other words, the quantity ordered multiplied by the unit price)

The basic steps will be as follows:

Establish column headers by configuring the Header Extractor property of each Data Column in the Data Table.
- You must configure header extractors for each Data Column whose data you want to collect.
- Alternatively, you may configure a Header Row Extractor set on the Data Table (This property is found in the Tabular Layout sub-properties).
Assign a Value Extractor for at least one Data Column.
- For example, we may expect to find a quantity for each item shipped on an invoice, regardless of the vendor. There's always a column with a "Quantity" or "QTY" or "Shipped" or some similar header.
- Since this data is also present on every row, this will provide the information necessary to find each row in the table.
- While you need at least one Data Column's Value Extractor configured to detect rows, multiple columns may be used to detect rows.
  - Furthermore, a Data Column's Value Extractor will either perform "Primary Extraction" to perform row detection or "Secondary Extraction" to extract data from already detected rows. We will discus using multiple columns to detect rows and the differences between "Primary" and "Secondary Extraction" in the #Advanced Setup Considerations section of this article.
Set the Data Table object's Extract Method property to Tabular Layout.
- And configure any Tabular Layout properties as needed. We will discuss many of these properties, why and how to to use them in the #Advanced Setup Considerations section of this article.
Test to ensure the table's data is collected.

In a perfect world, you're done at that point. As you can see in this example, we've populated a table. Data is collected for all four Data Columns for each row on the document.

However, the world is rarely perfect. We will discuss some further configuration considerations to help you get the most out of this table extraction method in the #Advanced Setup Considerations section below.

1. Configure Header Extractors

As far as strict requirements go for the Tabular Layout method goes, you must at minimum establish column headers for each Data Column you wish to extract.

We'll start with the "Quantity" Data Column.

FYI: If the invoice lists both a "quantity ordered" and a "quantity shipped" column, we will be collecting the quantity shipped.

Select the Data Column. Select the Header Extractor property. Here you will set an extractor to locate the column header on the document for the selected Data Column. Using the dropdown selector, select the Extractor Type you wish to configure to return the column header. You can use whatever Extractor Type you want to get the job done. You may select Reference to reference a Data Type or Value Reader extractor object you've configured already. Or, you can select one of the other Extractor Types to configure extraction locally. We're going to select List Match.
The List Match extractor is well suited for our purposes here. Ultimately, we will enter a list of various ways a "Quantity" column can be labeled. For example, this document labels quantities of each item ordered as "HRS / QTY" So, we've added `HRS / QTY` to the Local Entries list. Other documents use the label "Quantity" or "Shipped". So, we've added `Quantity` and `Shipped` to the list as well. You would then continue adding variations to the list until all variations of the "Quantity" column's header labels are extracted for every variation of the table. Or more generally, until a result for the column header is extracted using whatever Extractor Type you've chosen to configure.
Pro Tip: Stacked Labels You will often find "stacked labels" in tables. These are multi-word labels broken up across multiple lines in the table's header. For example, this document's "Quantity" column uses "Qty Shp." for its label. This is a stacked label, with "Qty" on one line and "Shp." on another. We can add "Qty Ship." to our list of header labels. However, we will not get a result returned for the document.
We can easily resolve this by enabling the Vertical Wrap feature. This feature is only available to the List Match extractor. This is one of the reasons why List Match is so useful for extracting column headers. To enable Vertical Wrap: Switch to the "Properties" tab. Change the Vertical Wrap property to Enabled. With Vertical Wrap enabled, the extractor is able to match and return items in the list that wrap vertically on multiple lines. In our case, our stacked label "Qty Shp." is now returned.
Repeat Until All Data Columns Are Configured You will repeat the same process for each Data Column you want to collect. We want to collect data from all these columns. So, we've configured each Data Column's Header Extractor property. Once the Header Extractor for each Data Column is configured, Grooper will "know" where our tables "start". However, all the actual data in the table is defined by its rows. How does Grooper know where each row is? We will discuss that in the next tab.

For our document set, we used the following lists of header column labels:

"Item Number"

ITEM NO
ITEM #
Item Number
Part Number/Description

"Description"

DESCRIPTION
Part Number/Description

"Quantity"

HRS / QTY
Quantity
Shipped
Qty Shp.

"Unit Price"

RATE / PRICE
UNIT PRICE

"Line Total"

SUBTOTAL
TOTAL
Extended Price
Ext. Price
Ext Price
NET AMOUNT

FYI

You may have noticed Part Number/Description is present in both the "Item Number" and "Description" columns' header lists.

This can happen. Depending on a table's format, what would normally be divided up between two columns on other documents may be jammed into one. There are methods to account for this using Tabular Layout. We will address this in the #SECTION NAME TBD section of this article.

2. Assign a Data Column's Value Extractor

This step is all about row detection.

So far all we've done is established header column positions on each document. But, that's not where the data is. The table's data is in the rows.

As it stands, Grooper doesn't know anything about the rows in the tables. It doesn't know the size of each row. It doesn't know what kind of data is supposed to be in the rows. Maybe most importantly, it doesn't know how many rows there are. Tables tend to be dynamic. They may have 3 rows on one document and 300 on the next. Grooper needs a way of detecting this.

To detect rows, we need at least one Data Column's Value Extractor property configured. For each result the extractor produces below the column's header, Grooper will create one row instance.

The key thing to keep in mind is this data must be present on every row. You'll want to pick a column whos data is always present for every row, where it would be considered invalid if the information wasn't in that cell for a given row.

In our case, we will choose the "Quantity" Data Column. We always expect (for the time being anyway) there to be a quantity listed for the line item on the invoice.

We will use this Value Reader for our demonstration.
- However, in the real world, the extraction world is your oyster. You'll configure an extractor to best target the data in whatever table column you're trying to extract.
This is a fairly simple Pattern Match extractor designed to return numeric data (including currency).
The regex is a fairly simple pattern to match generic quantities.
- It'll match decimal values from 0 and above with two decimal places optional.
We've also edited our Prefix and Suffix Patterns so that the pattern must be surrounded by a space character before and after, with an optional dollar sign before the number.
As you can see, we get five results below the "Quantity" label.
- When we assign this Value Reader to the "Quantity" Data Column, we should then get five rows when this table extracts.

We do get a bunch of other hits as well. This is a very generic extractor matching very generic numerical data.

Will this result present a problem? Will we get an extra row for its result?
- No. That result is above the header label HRS / QTY established by the Data Column's Header Extractor.
- The Tabular Layout method presumes rows are below column labels. Any and all results above the first instance of the column's headers will be ignored.
What about these matching results on the same line? Will the extra results create additional row instances?
- No. These results are misaligned with the "Quantity" Data Column's header. They are too far to the right to be considered under the column header. They will be ignored.
- Only results aligned with the "Quantity" Data Column's header will create a row instance.
What about these results? Will they produce a row?
- No. These results are also misaligned with the "Quantity" Data Column's header.
- That said, if these were aligned with the "Quantity" Data Column's header, they would produce row instances.
- When you are building your own Data Column extractors, pay close attention to results below the column's header. They have the most potential to produce false positive results, producing erroneous rows.
  - That said, there are a multitude of ways to avoid false positive row results when using Data Columns' Value Extractors to detect rows. We will discuss this more in the #Advanced Setup Considerations portion of this article.

With our extractor ready to go, all we need to do is assign it to the "Quantity" Data Column using its Value Extractor property.

Select the Data Column you wish to configure.
- In our case, we want to configure the "Quantity" Data Column.
Configure the Value Extractor property.
- In our case, we've referenced our Value Reader designed to return generic numeric values.

FYI

At bare minimum you must configure at least one Data Column's Value Extractor to perform row detection.

However, multiple columns may be used to perform row detection by configuring their corresponding Data Columns Value Extractor properties. For more information on using multiple columns in row detection (as well as row detection in general) please visit the #Advanced Row Detection section of this article.

So far, we have:

Collected labels for the Data Column labels (and optionally the whole row of column labels for the Data Table)
Configured at least one Data Column with its Value Extractor configured.

For fairly simple table structures, we now have the two things the Tabular Layout method needs to extract data. Now, all we need to do is tell the Data Table object we want to use the Tabular Layout method. We do this by setting its Extract Method property to Tabular Layout.

3. Set Extract Method to Tabular Layout A Data Table's extraction method is set using the Extract Method property. To enable the Tabular Layout method, do the following. Select a Data Table object in your Data Model. Here, we've selected the "Line Items" Data Table. Select the Extract Method property. Using the dropdown menu, select Tabular Layout
4. Test Now, let's test out what we have and see what we get! For the selected document folder in the "Batch Viewer" window... Press the "Test Extraction" button. The results show up in the "Data Element Preview" window. Success! Our table's data is collected!
So, how was Grooper able to do this? For the Tabular Layout method, the Data Table is populated using primarily two pieces of information: column header locations established by the Data Columns' Header Extractors and rows locations detected by a Data Column's Value Extractor. Remember, we configured Header Extractors for all Data Columns. We configured only the "Quantity" Data Column's Value Extractor'. First, it's all about establishing column headers. The Data Columns' Header Extractors established the column locations for each column. Grooper then determines the width of these columns. If table lines are present, Grooper can detect those line locations via a Line Detection (or Line Removal) IP Command. Grooper will "snap" the column's width to the detected line boundaries, expanding the cell's width (and height) to the boundaries around it. Table lines give human readers an indicator of where the data "lives" (or is contained). If it's in the box, it belongs to the column. If it's out of the box, it belongs to a different column. If table lines are not present (as is the case for this document), Grooper will average the distance between one header label and the next.
Second, it's all about detecting rows. Rows are detected using a Data Column's Value Extractor. In our case, we configured the "Quantity" Data Column's Value Extractor. FYI: When a Data Column's extractor is used to detect rows, it is considered "Primary Extraction". A Data Column's extractor can also be used for "Secondary Extraction", performed after rows are detected. For more on this, please visit the #Primary VS Secondary Extraction section of this article. Rows are only detected below the detecting Data Column's header. Grooper runs the detecting Data Column's Value Extractor, looking for matching results aligned below the column header. For each result returned, Grooper establishes one row instance. Since our extractor was designed to return decimal values, and Grooper found five decimal values below our column header, Grooper detected five rows.
The Tabular Layout method now has the two pieces of information it needs to determine the table's structure. If you know where the columns are and how big they are, and you know how many rows there are, you pretty much know what the table looks like. It has column instances for each Data Column. Again, established by each Data Column's Header Extractor'. It has row instances for each detected row. Again, established by the detecting Data Column's Value Extractor. FYI: More than one Data Column can be used to detect rows. Please visit the #Advanced Row Detection section for more information.
With these column and row instances established, Grooper can form data instances for each cell of the table. Each cell's data simply lays where the columns and rows intersect. For Data Columns with their Value Extractors configured, values are either collected using "Primary" or "Secondary Extraction". Please see the #Primary VS Secondary Extraction portion for more information. For Data Columns without their Value Extractors configured, values are collected by returning the OCR or native text data within the geometric boundaries of the cell.

Alternative Configuration: Header Row Extractor

You may alternatively establish column headers for the entire row of header labels, using the Header Row Extractor property. Instead of configuring each Data Column's Header Extractor, you would configure an extractor to return the whole table's row of column headers and use named instances (either Named Groups or child extractors) to establish each Data Column's header.

There are two reasons using a Header Row Extractor can be beneficial:

It can be a way to throw out false positive column matches.
It can be a way to better take advantage of Fuzzy RegEx.

⚠

Configuring the Header Row Extractor will override all Data Columns Header Extractors.

You should choose to either establish column headers using the Header Row Extractor or do so using each Data Column's Header Extractors.

You may find it beneficial to configure Data Column Header Extractors as the "default" configuration and use Data Element Overrides to pick and choose which Document Types you want to use the Header Row Extractor instead.

Craft the Extractor

To configure the Header Row Extractor, you will need to craft an extractor (or multiple extractors for multiple table formats). We will choose to do that first by creating a few Value Reader and Data Types.

We've started creating a Value Reader to use as a Header Row Extractor for the "Fairdeal" Document Type in our Content Model.
We're using a Pattern Match Extractor Type.
- We can easily match the header row for "Fairdeal" invoices using a simple regex pattern.
Your first task will be to extract the entire row of column headers. The pattern we have here will do just that.

DESCRIPTION\t
ITEM NO\t
HRS / QTY\t
PER\t
RATE / PRICE\t
SUBTOTAL

The pattern matches the whole row of column headers.

This is only step one. Next, we need some way of breaking up the result into each component column. How does Grooper know what part of the result is the label for the "Description" column or the "Quantity" column? It doesn't until you break up the result into named instances that match the names of your Data Columns in the Data Table. These named instances can either be:

Named Groups
Named Child Extractors

Assign Named Instances: Using Named Groups

When pattern matching a header row, you can do this with Named Groups.

We've placed the portion of the regular expression matching the "Description" column's label in a Named Group.
- (?<Description>DESCRIPTION)
- The group is created just like any group by placing the regex in parenthesis
  - (regex goes here)
- The group is named by inserting the ?<> tag.
  - ?<>(regex goes here)
- The name is given by typing it between the angle brackets.
  - (?<name goes here>regex goes here)
This produces a named instance capturing only the regex in the group.
- In this case, the label for the "Description" column.
The key here is that the name we gave the group, "Description", matches the name of the Data Column, "Description".
- Since the names match, Grooper will use the Named Group's instance to establish the column header for the "Description" Data Column.
- Effectively, the Named Group supplies the result for the Data Column's Header Extractor.
  - BE AWARE! This also means the Named Group replaces the result of a Data Column's Header Extractor. If you configure a Header Row Extractor, it will supersede any Header Extractor on any Data Column.

You would then continue placing Named Groups around the remaining column headers, chunking out the regex and matching each chunk with the corresponding Data Column.

The following regex would accomplish this goal in our case.

(?<Description>DESCRIPTION)\t
(?<Item_Number>ITEM NO)\t
(?<Quantity>HRS / QTY)\t
PER\t
(?<Unit_Price>RATE / PRICE)\t
(?<Subtotal>SUBTOTAL)

⚠	Please note space characters are not allowed in Named Groups. You must replace a space character with an underscore `_`. For example, to match the "Item Number" Data Column, we named the group `Item_Number`

Assign Named Instances: Using Named Child Extractors

You may also create and use the named instances by naming a Data Type's child extractors to match the names of your Data Columns.

For example, this Data Type uses the Ordered Array collation method to return the header row for our "Factura" Document Type.
We still get one complete result for the header row on this invoice format.
Instead of a single regex pattern, we're collating results from its child extractors.
Each child extractor's name matches one of our Data Columns.
Inspecting the header row's instance (by right-clicking the result in the Results list), we can see more clearly how these results sub-instances will be supplied as each Data Column's header.

In the Instance Viewer, we can select any of our sub-instances from our child extractors.
This result is what will be used for the "Description" column's header.
Since the name of the child extractor (and therefore also sub-instance) matches the "Description" Data Column, the result will be used in place of its Header Extractor.

Assign the Header Row Extractor

Now that we have a couple examples of header row extractors, we can assign them using Tabular Layout's Header Row Extractor property.

To assign a header row extractor, select the Data Table.
Expand the Tabular Layout sub-properties.
Expand the Header Detection sub-properties.
Using the Header Row Extractor property, configure your header row extractor.
In our case, we set Header Row Extractor to Reference and pointed to one of the extractors detailed previously.
When the Data Table extracts, the Tabular Layout method will use the Header Row Extractor's named instances to establish each Data Column's header locations.

⚠

But be careful! If you choose this approach, assigning the Header Row Extractor will supplant any Header Extractor configuration for any of your Data Columns.

• If configured, the Header Row Extractor establishes column headers instead of multiple Data Columns Header Extractors.

The extractor we referenced was very specifically designed with only one table format in mind. It works for invoices assigned the "Fairdeal" Document Type, but no others.

If we were to test our Data Table on a different document with a different table structure, we would get no results.
Because the extractor doesn't match this table format's row of column headers, it can't establish any column headers for this document.
This is despite the fact these Data Columns have their Header Extractor properties configured to do so.

If you take this approach to establish column headers you will either need to:

Craft a single extractor that matches multiple row header formats.
Or, use Data Element Overrides to configure a unique Header Row Extractor for each Document Type.

Why Bother?

There are two main reasons why Header Row Extractors can be beneficial:

To throw out false positive column header matches
To better match column headers with poor OCR using Fuzzy RegEx.

To Throw Out False Positives

The first reason to use a Header Row Extractor is to help eliminate false positive column header matches.

Take our "Line Total" Data Column.
Its Header Extractor is configured with List Match extractor, matching a variety of possible header labels for this column

This table format uses the label SUBTOTAL for the "Line Total" column.
It certainly matches the column header correctly.
But it also matches an instance on this document where the same term is used to refer to something different.
- This is a false positive match.

A row of header labels tends to be more specific (and requires more specific extraction logic).

If we refer back to our Header Row Extractor for this document, we'll see there is no potential false positive.
The extractor matches the label SUBTOTAL as part of the larger row of headers.
Given that the extractor is now looking for that label within the larger context of a header row, our false positive is no longer returned.

This is to be sure a more specific, and therefore more accurate extractor. However, you shouldn't always assume more accurate is necessarily "necessary". In this case, the false positive did not impact our table whatsoever. So, while, yes, the Header Row Extractor is technically more accurate, our Data Table would have returned accurate data using Data Column headers alone (even with the false positive match).

While a Header Row Extractor can eliminate false positive column header matches, you only need to go through the trouble of configuring one if those false positive matches poorly impact your data extraction.

For Fuzzy RegEx

The other reason to use a Header Row Extractor has to do with imperfect OCR text data and Fuzzy RegEx. Fuzzy RegEx provides a way for regular expression patterns to match in Grooper when the text data doesn't strictly match the pattern. The difference between the regex pattern Grooper and the character string "Gro0per" is just off by a single character. An OCR engine misreading an "o" character for a zero is not uncommon by any means, but a standard regex pattern of Grooper will not match the string "Gro0per". The pattern expects there to be an "o" where there is a zero.

Using Fuzzy RegEx instead of regular regex, Grooper will evaluate the difference between the regex pattern and the string. If it's similar enough (if it falls within a percentage similarity threshold) Grooper will return it as a match.

FYI "similarity" may also be referred to as "confidence" when evaluating (or scoring) fuzzy match results. Grooper is more or less "confident" the result matches the regex pattern based on the fuzzy regex similarity between the pattern and the imperfect text data. A similarity of 90% and a confidence score of 90% are functionally the same thing (One could argue there is a difference between these two terms when Fuzzy Match Weightings come into play, but that's a whole different topic. And you may encounter Grooper users who use the terms "similarity" and "confidence" interchangeably regardless. Visit the Fuzzy RegEx article if you would like to learn more).

Let's go back to the List Match extractor for our "Line Total" Data Column's Header Extractor.

This table format uses the label TOTAL for the "Line Total" column.
However, it does not match the header on the document.
Why not? This is due to imperfect OCR results.
- The label TOTAL was misrecognized as TOFAL.

We can certainly get this label to match with Fuzzy RegEx, but only at a fairly low similarity.

Here, we've enabled Fuzzy Matching and set the Minimum Similarity to 85%.
We do get our header label returned.
But it's at a confidence score of 86%.
- This score may be too low. It's not causing a problem for this document, but it may pose issues for others.

The reason why the similarity score is so low is because "TOTAL" is a relatively small word, five characters long. Grooper's confidence rating in a match lessens, the more character swaps it has to make to match the word.

An entire row of headers, on the other hand, has much more characters in it. The cost to swap a single character in the entire row of headers will be much less, and much more negligible.

This Value Reader is designed to match the whole header row for this invoice format.
Its Fuzzy Matching property is enabled with its Minimum Similarity set to 90%.
The whole header row matches at a much higher confidence score of 98%.

Click here to return to the top

Tabular Layout With Label Sets

The Tabular Layout Method1. Collect Column Labels2. Assign a Data Column's Value Extractor3. Set Extract Method to Tabular Layout and 4. Test

The Tabular Layout Method

This tutorial will cover the basic configuration of the Tabular Layout method with Label Sets, using a Labeling Behavior to collect column headers. We will use invoices for our document set and collect the following data from their tables detailing line item information:

Item Number - The vendor's id number for the item ordered for each row.
Description - The description of each item ordered for each row.
Quantity - The number of the item ordered for each row.
Unit Price - The vendor's price for the item ordered for each row.
Line Total - The total price for the number of items ordered (In other words, the quantity ordered multiplied by the unit price)

The basic steps will be as follows:

Establish column headers by collecting labels for each Data Column in the Data Table.
- You must collect header labels for each Data Column whose data you want to collect.
- You may optionally collect a label for the entire row of header labels by collecting a label for the Data Table.
  - It is also considered best practice to do so when using Label Sets to configure Tabular Layout.
Assign a Value Extractor for at least one Data Column.
- For example, we may expect to find a quantity for each item shipped on an invoice, regardless of the vendor. There's always a column with a "Quantity" or "QTY" or "Shipped" or some similar header.
- Since this data is also present on every row, this will provide the information necessary to find each row in the table.
- While you need at least one Data Column's Value Extractor configured to detect rows, multiple columns may be used to detect rows.
  - Furthermore, a Data Column's Value Extractor will either perform "Primary Extraction" to perform row detection or "Secondary Extraction" to extract data from already detected rows. We will discus using multiple columns to detect rows and the differences between "Primary" and "Secondary Extraction" in the #Advanced Setup Considerations section of this article.
Set the Data Table object's Extract Method property to Tabular Layout.
- And configure any Tabular Layout properties as needed. We will discuss many of these properties, why and how to to use them in the #Advanced Setup Considerations section of this article.
Test to ensure the table's data is collected.

In a perfect world, you're done at that point. As you can see in this example, we've populated a table. Data is collected for all four Data Columns for each row on the document.

However, the world is rarely perfect. We will discuss some further configuration considerations to help you get the most out of this table extraction method in the #Advanced Setup Considerations section below.

1. Collect Column Labels

The following tutorial will presume you have general familiarity with collecting labels. See the Label Sets article for a full explanation of how to collect labels for Document Types in a Content Model.

As far as strict requirements go for the Tabular Layout method goes, you must at minimum establish column headers for each Data Column you wish to extract.

We'll start with the "Quantity" Data Column.

FYI: If the invoice lists both a "quantity ordered" and a "quantity shipped" column, we will be collecting the quantity shipped.

For this "Fairdeal" Document Type, one column header label has been collected for each of the five Data Column children of the "Line Items" Data Table.

The label ITEM NO for the "Item Number" Data Column
The label DESCRIPTION for the "Description" Data Column
The label HRS / QTY for the "Quantity" Data Column
The label RATE / PRICE for the "Unit Price" Data Column
The label SUBTOTAL for the "Line Total" Data Column

As far as strict requirements go for establishing header columns, you're done at this point. You would then repeat this same process for every Document Type in your Content Model.

Best Practice: Collect a Header Row Label for the Data Table

You may optionally collect a label for the entire row of column header labels (aka the "header row label"). This label is collected for the parent Data Table object's label.

We've collected the label DESCRIPTION ITEM NO HRS / QTY PER RATE / PRICE SUBTOTAL for the "Line Items" Data Table.

It is considered best practice to capture a header row label for the Data Table. But if it's optional, why do it? What is the benefit of this label?

Why Bother?

There are two main reasons why Header Row Extractors can be beneficial:

To throw out false positive column header matches
To better match column headers with poor OCR using Fuzzy RegEx.

To Throw Out False Positives

The first reason to collect a header row label is to help eliminate false positive column header matches.

Take our "Line Total" Data Column's label SUBTOTAL.
Without the Data Table's header row label, this label would also produce a match.
- This is a false positive match. This is an instance on this document where the same term is used to refer to something different.
With the header row label, only the actual label for the column matches.
- Another way of putting it: The Data Column header labels will only match if they are part of the larger Data Table header row label.

For Fuzzy RegEx

The other reason to collect a header row label has to do with imperfect OCR text data and Fuzzy RegEx. Fuzzy RegEx provides a way for regular expression patterns to match in Grooper when the text data doesn't strictly match the pattern. The difference between the regex pattern Grooper and the character string "Gro0per" is just off by a single character. An OCR engine misreading an "o" character for a zero is not uncommon by any means, but a standard regex pattern of Grooper will not match the string "Gro0per". The pattern expects there to be an "o" where there is a zero.

Using Fuzzy RegEx instead of regular regex, Grooper will evaluate the difference between the regex pattern and the string. If it's similar enough (if it falls within a percentage similarity threshold) Grooper will return it as a match.

FYI "similarity" may also be referred to as "confidence" when evaluating (or scoring) fuzzy match results. Grooper is more or less "confident" the result matches the regex pattern based on the fuzzy regex similarity between the pattern and the imperfect text data. A similarity of 90% and a confidence score of 90% are functionally the same thing (One could argue there is a difference between these two terms when Fuzzy Match Weightings come into play, but that's a whole different topic. And you may encounter Grooper users who use the terms "similarity" and "confidence" interchangeably regardless. Visit the Fuzzy RegEx article if you would like to learn more).

So how does this apply to the Data Table's header row label? The short answer is it provides a way to increase the accuracy of Data Column header labels by "boosting" the similarity of the label to imperfect OCR results.

We're going to look at labels collected for the "Rechnung" Document Type to illustrate this.
Examine the collected label for the "Line Total" Data Column.
- Notice the label TOTAL is highlighted red. The label doesn't match the text on the document.
- This is due to imperfect OCR results.
OCR made some missteps and recognized that segment as TOFAL.
- The second "T" in "TOTAL" was recognized as an "F" character.
- This means "TOTAL" (the expected label) is one character's difference from "TOFAL" (the actual text data). Or, "TOFAL" is 80% similar to "TOTAL".
- The Labeling Behavior's similarity threshold is set to 90% for this Content Model. 80% is less than 90%. So, the result is thrown out.
- FYI: This threshold is configured when the Labeling Behavior is added, using the Behaviors property of a Content Model. The Label Similarity property is set to 90% by default, but can be adjusted at any time.

As we will see, capturing the full row of column header labels will boost the similarity, allowing the label to match without altering the Labeling Behavior's fuzzy match settings.

Here, we've collected a header row label for the Data Column.
Now the "Line Total" Data Column's label matches! MAGIC!

Not magic. Just math.

The Data Table's column header row label is much much longer than a single Data Column's column header label. There are just more characters in PO ITEM # DESCRIPTION QUANTITY UNIT PRICE TOTAL\r\nLINE # than TOTAL (55 vs 5).

Where the "Line Total" Data Column's label is 80% similar to the text data (4 out of 5 characters), the "Line Item" Data Table's label, comprised of the whole row of column labels, is roughly 98% similar to the text data (54 out of 55 characters).

Utilizing a Data Table label allows you to hijack the whole row's similarity score when a single Data Column does not meet the similarity threshold.

If the label can be matched as a part of the larger whole, its confidence score goes up much further than by itself.
The Data Table's larger label of the full row of column labels gives extra context to the "Line Items" Data Column label, providing more information about what is and is not an appropriate match.

So why is it considered best practice to capture a header row label for the Data Table? OCR errors are unpredictable.

The set of examples you worked with when architecting this solution may have been fairly clean with good OCR reads. Maybe it didn't seem like you needed a Data Table label at the time, but that may not always be the case. Capturing a Data Table label for the header row will act as a safety net to avoid unforeseen problems in the future.

2. Assign a Data Column's Value Extractor

This step is all about row detection.

So far all we've done is established header column positions on each document. But, that's not where the data is. The table's data is in the rows.

As it stands, Grooper doesn't know anything about the rows in the tables. It doesn't know the size of each row. It doesn't know what kind of data is supposed to be in the rows. Maybe most importantly, it doesn't know how many rows there are. Tables tend to be dynamic. They may have 3 rows on one document and 300 on the next. Grooper needs a way of detecting this.

To detect rows, we need at least one Data Column's Value Extractor property configured. For each result the extractor produces below the column's header, Grooper will create one row instance.

The key thing to keep in mind is this data must be present on every row. You'll want to pick a column whos data is always present for every row, where it would be considered invalid if the information wasn't in that cell for a given row.

In our case, we will choose the "Quantity" Data Column. We always expect (for the time being anyway) there to be a quantity listed for the line item on the invoice.

We will use this Value Reader for our demonstration.
- However, in the real world, the extraction world is your oyster. You'll configure an extractor to best target the data in whatever table column you're trying to extract.
This is a fairly simple Pattern Match extractor designed to return numeric data (including currency).
The regex is a fairly simple pattern to match generic quantities.
- It'll match decimal values from 0 and above with two decimal places optional.
We've also edited our Prefix and Suffix Patterns so that the pattern must be surrounded by a space character before and after, with an optional dollar sign before the number.
As you can see, we get five results below the "Quantity" label.
- When we assign this Value Reader to the "Quantity" Data Column, we should then get five rows when this table extracts.

We do get a bunch of other hits as well. This is a very generic extractor matching very generic numerical data.

Will this result present a problem? Will we get an extra row for its result?
- No. That result is above the header label HRS / QTY.
- The Tabular Layout method presumes rows are below column labels. Any and all results above the first instance of the column's headers will be ignored.
What about these matching results on the same line? Will the extra results create additional row instances?
- No. These results are misaligned with the "Quantity" Data Column's header. They are too far to the right to be considered under the column header. They will be ignored.
- Only results aligned with the "Quantity" Data Column's header will create a row instance.
What about these results? Will they produce a row?
- No. These results are also misaligned with the "Quantity" Data Column's header.
- That said, if these were aligned with the "Quantity" Data Column's header, they would produce row instances.
- When you are building your own Data Column extractors, pay close attention to results below the column's header. They have the most potential to produce false positive results, producing erroneous rows.
  - That said, there are a multitude of ways to avoid false positive row results when using Data Columns' Value Extractors to detect rows. We will discuss this more in the #Advanced Setup Considerations portion of this article.

With our extractor ready to go, all we need to do is assign it to the "Quantity" Data Column using its Value Extractor property.

Select the Data Column you wish to configure.
- In our case, we want to configure the "Quantity" Data Column.
Configure the Value Extractor property.
- In our case, we've referenced our Value Reader designed to return generic numeric values.

FYI

At bare minimum you must configure at least one Data Column's Value Extractor to perform row detection.

However, multiple columns may be used to perform row detection by configuring their corresponding Data Columns Value Extractor properties. For more information on using multiple columns in row detection (as well as row detection in general) please visit the #Advanced Row Detection section of this article.

So far, we have:

Collected labels for the Data Column labels (and optionally the header row label for the Data Table)
Configured at least one Data Column with its Value Extractor configured.

For fairly simple table structures, we now have the two things the Tabular Layout method needs to extract data. Now, all we need to do is tell the Data Table object we want to use the Tabular Layout method. We do this by setting its Extract Method property to Tabular Layout.

3. Set Extract Method to Tabular Layout A Data Table's extraction method is set using the Extract Method property. To enable the Tabular Layout method, do the following. Select a Data Table object in your Data Model. Here, we've selected the "Line Items" Data Table. Select the Extract Method property. Using the dropdown menu, select Tabular Layout
4. Test Now, let's test out what we have and see what we get! For the selected document folder in the "Batch Viewer" window... Press the "Test Extraction" button. The results show up in the "Data Element Preview" window. Success! Our table's data is collected!
So, how was Grooper able to do this? For the Tabular Layout method, the Data Table is populated using primarily two pieces of information: column header locations established by the Data Columns' labels and rows locations detected by a Data Column's Value Extractor. Remember, we collected labels for all Data Columns. We configured only the "Quantity" Data Column's Value Extractor'. First, it's all about establishing column headers. The Data Columns' labels established the column locations for each column. Grooper then determines the width of these columns. If table lines are present, Grooper can detect those line locations via a Line Detection (or Line Removal) IP Command. Grooper will "snap" the column's width to the detected line boundaries, expanding the cell's width (and height) to the boundaries around it. Table lines give human readers an indicator of where the data "lives" (or is contained). If it's in the box, it belongs to the column. If it's out of the box, it belongs to a different column. If table lines are not present (as is the case for this document), Grooper will average the distance between one header label and the next.
Second, it's all about detecting rows. Rows are detected using a Data Column's Value Extractor. In our case, we configured the "Quantity" Data Column's Value Extractor. FYI: When a Data Column's extractor is used to detect rows, it is considered "Primary Extraction". A Data Column's extractor can also be used for "Secondary Extraction", performed after rows are detected. For more on this, please visit the #Primary VS Secondary Extraction section of this article. Rows are only detected below the detecting Data Column's header. Grooper runs the detecting Data Column's Value Extractor, looking for matching results aligned below the column header. For each result returned, Grooper establishes one row instance. Since our extractor was designed to return decimal values, and Grooper found five decimal values below our column header, Grooper detected five rows.
The Tabular Layout method now has the two pieces of information it needs to determine the table's structure. If you know where the columns are and how big they are, and you know how many rows there are, you pretty much know what the table looks like. It has column instances for each Data Column. Again, established by each Data Column's label. It has row instances for each detected row. Again, established by the detecting Data Column's Value Extractor. FYI: More than one Data Column can be used to detect rows. Please visit the #Advanced Row Detection section for more information.
With these column and row instances established, Grooper can form data instances for each cell of the table. Each cell's data simply lays where the columns and rows intersect. For Data Columns with their Value Extractors configured, values are either collected using "Primary" or "Secondary Extraction". Please see the #Primary VS Secondary Extraction portion for more information. For Data Columns without their Value Extractors configured, values are collected by returning the OCR or native text data within the geometric boundaries of the cell.

Click here to return to the top

Advanced Setup Considerations

The Tabular Layout method is designed to extract tabular data even with the most basic setup described above.

However, sometimes "basic" just isn't enough. For more complicated table structures, the Tabular Layout method has a robust suite of configurable properties. Understanding these properties will allow you to better extract a wider variety of tabular data.

In this section, we will discus the following advanced setup features for Tabular Layout:

#Multiline Rows
#Advanced Row Detection
#Primary VS Secondary Extraction
#Footer Detection

Multiline Rows

The table from the previous example had a single-line table structure. Each row occupied one line. Table extraction can get a little trickier when tables have a multiline row structure, especially if sometimes the table rows occupy a single line and sometimes multiple.

For example, most the rows on the line items table for the "Arve" Document Type are single-line rows. However, for some rows, the "Description" column spans multiple lines.

We have selected the "Line Items" Data Table in our Node Tree.
The Extract Method is set to Tabular Layout.
- The default settings are configured. The labels were collected according to the previous tutorials.
With just the default settings, Grooper presumes your tables have a single-line structure. This works fine until you come across rows like this. Grooper only extracts the first line in the "Description" column.
The rest of the description on the second line in this case is not extracted.

This is what the Multiline Rows property is for. Enabling this property will allow you to target table structures like this whose rows extend beyond just a single line on the page.

Here, we have enabled the Multiline Rows property by switching it from Disabled to Enabled.
The Tabular Layout method now appropriately detects this row as occupying multiple lines on the document.
The full line item description is now property extracted by this Data Table.

The Multiline Rows functionality will even detect multiline rows if the lines start on one page and continue to another.

Such is the case here where the row starts at the bottom of the first page.
However, the "Description" column continues to the next page.
All four rows are collected when the Data Table extracts the document.

Detect Stacked Layout

Advanced Row Detection

Row Detection Using Multiple Columns

The Minimum Cell Count Property

Disabling Row Detection

Footer Detection

Generally speaking, a "footer" is something on a document indicating the end of a portion of text. It could be the end of a page, end of a paragraph, end of a chapter, or even the end of something like a table.

If you can find some sort of text label that signifies the end of the table, you can often avoid extraction errors where the Tabular Layout method overextends, capturing extra rows of "junk data".

Such is the case for the "Factura" Document Types.
The Data Table properly extracts the first three rows but also extracts an extra one erroneously.
It extended the table way down to form a row on the bottom of the page.
- All the technical requirements to form a row were satisfied. Remember, we set a Value Extractor for the "Quantity" Data Column in order to form the row instances of this table. The "91" in the address line "91 Vahlen Plaza" just happens to be below the label for the "Quantity" Data Column for this Document Type.

We need a way to remove this row. We need some way of telling Grooper this row is invalid.

We can do that very easily in this case with a Footer Label.

We would not expect there to be a row after this text. Every valid row for this Document Type should be above the words "Due Date" and "Amount Due".
- We will add a Footer Label to the "Line Items" Data Table's labels to force the Tabular Layout method to quit populating table rows after this segment of text.
- It indicates a hard stopping point. Do not pass footer. Do not collect 200 extra rows.

To do this, we will need to update the "Factura" Document Type's Label Set.

Navigate to the Content Model with Labeling Behavior enabled in the Node Tree.
Switch to the "Labels" tab.
In the "Batch Selector", select the Document Type whose Label Set you wish to edit.
- In our case, the "Factura" Document Type.
- Remember, the document folder you select must be classified. It must have the Document Type assigned to it (even if you do so manually using the "Set Type..." button).
Select the Data Element whose labels you wish to edit.
- In our case we want to add a Footer label to aid in the Tabular Layout extraction method's table extraction. We will need to select the "Line Items" Data Table.
Select the Footer tab.
Collect the label.
- Visit the Collect Label Sets section of this article if you need more information how.

When we test extraction for this Data Table now, we will get a different (and better) result.
With just that Footer label added and no other configuration. The Tabular Layout method now extracts the table data correctly. The superfluous row is no longer returned.
Any potential rows below the Footer label will not be collected for the Data Table.

Capture Footer Row VS Display Total Row

Primary VS Secondary Extraction

As we've seen before at least one Data Column must have its Value Extractor property configured. This is (part of) how the Tabular Layout method models the row structure of the table. Grooper needs some piece of information to tell where each row is (and how many rows to the table there are). In other words, it needs at least one Data Column's Value Extractor results to detect each row. In our Content Model we've been working with, we configured the "Quantity" Data Column's Value Extractor. The Value Extractor is set to reference a Data Type extractor returning simple quantities. One row is produced for each result encountered below the label for the "Quantity" Data Column in the Document Type's Label Set.
As we've seen before, we can select the parent Data Table object and test extraction. Sure enough, we return three rows for the line items table, one for each result the "Quantity" Data Column's Value Extractor returned. However, check out the results in the "Line Total" column. These are not valid currencies. The result "40.700.00" should be "40,700.00" with a comma instead of a decimal.
A Data Column's Value Extractor may also be used for data validation and/or cleansing purposes as well as finding the rows in a table. Here, we have selected a Data Type extractor that returns currency values using fuzzy matching. Notice how this extractor returns this result. Rather than returning the text data, "40.700.00" it returns "40,700.00" with a confidence score of 98%. The fuzzy match swapped the decimal for a comma. Now we have a valid currency amount.
If we assign this extractor to the "Line Total" Data Column's Value Extractor, upon extraction the Data Table will collect the extractor's result, ultimately cleansing the imperfect text data and returning the valid currency amount. Here, we have selected the "Line Total" Data Column in the Node Tree. For the Value Extractor property, we have selected Reference. Using the Extractor sub-property we have pointed to the Data Type we were looking at earlier, which will using fuzzy matching to return currency values.
Now, with the "Line Total" Data Column utilizing an extractor, we will get a more accurate result. Here, we've gone back up to the "Line Items" Data Table to test extraction. We're still getting three rows, one for each line of the table on the document. And we're getting accurate values for the "Line Item" column. The bad text data "40.700.00" was cleansed to "40,700.00" We used an extractor using fuzzy matching to return currency values. The decimal was swapped for a comma. (For more information on fuzzy matching, visit the Fuzzy RegEx article)

Secondary Extract Modes: Auto VS Cell Extract VS Geometric Extract

Secondary Extract Modes: Row Extract

How and Why to Disable Row Detection (DELETE ME)

Continuing from the example in the previous tab, let's play "What if...?"

What if for rows with a zero dollar "Unit Price" amount and zero dollar "Line Total" amount, those cells were totally blank instead of "0.00"?

If that were the case, by configuring the "Line Total" Data Column's Value Extractor, we actually would end up throwing out the row entirely, preventing us from collecting data we want.

Remember, a Data Column's Value Extractor results are used to detect each row in the table. If there's no result, there's no row. Furthermore, by having two Data Columns' Value Extractors configured, both pieces of information are going to be required to find and return the row. Here, we have an extractible value for the "Quantity" Data Column's Value Extractor. However, we do not for the "Line Item" Data Column's Value Extractor. There is just blank space. There is no longer a currency value for the extractor to match on this row. Oh oh! This means this is no longer a valid row according to the default use of the two Data Columns' Value Extractor results.
Upon testing extraction, the "Line Item" Data Table only collects two rows, instead of three. The entire second row of this table is thrown out. Because the "Line Total" Data Column now has its Value Extractor configured, it must match a result in order to collect a table row. This is obviously not want we want. There's still data we want on that row, namely the "Quantity" and "Description" column data.
Truly, we only want the "Quantity" Data Column to be used for row detection (We're presuming every row has a value for the "Quantity" column). We don't really want or need the "Line Items" Data Column's Value Extractor to detect the rows (The "Quantity" Data Column's Value Extractor is already doing that job). Instead, we want to perform secondary extraction for the "Line Items" column, once a row is already located (and a row instance is established for each row's data). This is what the Tabular Layout' method's Column Settings properties are for. This set of properties will allow you to choose if and how a Data Column's Value Extractor is used for row detection or secondary extraction after a row instance is formed. Expand the Tabular Layout sub-properties. Select the Column Settings property and press the ellipsis button at the end. This will bring up a collection editor to manipulate the Column Settings properties for each Data Column.
From here, you select the Data Column you wish to configure. These settings will allow you to override the default settings for how a Data Column's Value Extractor is used for row detection and secondary extraction. In our case, we're selecting the "Line Total" Data Column. We've changed the Row Detection property to Disabled. This will prevent the "Line Total" Data Column's Value Extractor from being used for row detection. Our table rows will not be thrown out if it fails to produce a result. Instead, the extractor will be used for secondary extraction against the column's cell. Press "OK" when finished.

FYI

The Secondary Extract property will control if secondary extraction is performed at all (Always or Never).

The Secondary Extract Mode will control how secondary extraction is performed. The Data Column's Value Extractor can run against the cell in the row's text (CellExtract), it can run against the full row's text (RowExtract), or the extractor can be ignored entirely and the text falling within the geometric boundaries of the cell will be returned (Geometric).

The default of Auto will execute the CellExtract mode first, then Geometric if it fails.

In other words, it will attempt to run the Data Column's Value Extractor against the cell first. But, if that extractor fails, returning no result, whatever text data falls within the cell's boundary is returned.

With these Column Settings updated, we will see a different (better) result.
Upon testing extraction, the second row of the table is now properly collected.
Because, the "Line Total" Data Column's Value Extractor is no longer used for row detection.
But, it's still used for secondary extraction against the table cell, returning "40,700.00" instead of "40.700.00".

By manipulating the extraction logic of the Data Columns, we have more control over how those results are used for row detection or secondary extraction for data validation and cleansing.

Revision as of 16:14, 16 May 2022

About

Basic Setup

Tabular Layout Without Label Sets

The Tabular Layout Method

1. Configure Header Extractors

Pro Tip: Stacked Labels

Repeat Until All Data Columns Are Configured

2. Assign a Data Column's Value Extractor

3. Set Extract Method to Tabular Layout

4. Test

Alternative Configuration: Header Row Extractor

Craft the Extractor

Assign Named Instances: Using Named Groups

Assign Named Instances: Using Named Child Extractors

Assign the Header Row Extractor

Why Bother?

To Throw Out False Positives

For Fuzzy RegEx

Tabular Layout With Label Sets

The Tabular Layout Method

1. Collect Column Labels

Best Practice: Collect a Header Row Label for the Data Table

Why Bother?

To Throw Out False Positives

For Fuzzy RegEx

2. Assign a Data Column's Value Extractor

3. Set Extract Method to Tabular Layout

4. Test

Advanced Setup Considerations

Multiline Rows

Detect Stacked Layout

Advanced Row Detection

Row Detection Using Multiple Columns

The Minimum Cell Count Property

Disabling Row Detection

Footer Detection

Capture Footer Row VS Display Total Row

Primary VS Secondary Extraction

Secondary Extract Modes: Auto VS Cell Extract VS Geometric Extract

Secondary Extract Modes: Row Extract

How and Why to Disable Row Detection (DELETE ME)