2021:Tabular Layout (Table Extract Method)
| WIP | This article is a work-in-progress or created as a placeholder for testing purposes. This article is subject to change and/or expansion. It may be incomplete, inaccurate, or stop abruptly.
This tag will be removed upon draft completion. |
Tabular Layout is one of Grooper's methods of extracting table data from documents available to Data Table objects (via its Extract Method property). This method uses column header values determined by the Data Columns Header Extractor results (or labels collected for the Data Columns when a Labeling Behavior is enabled) as well as Data Column Value Extractor results to model a table's structure and return its values.
The Tabular Layout is "Label Set aware". You can configure Tabular Layout with or without labels. This article will detail both methods. For more information on Label Sets, please visit the full Label Sets article.
About
Many tables label the columns so the reader knows what the data in that column corresponds to. How do you know the unit price for an item on an invoice? Typically, that item is in a table and one of the columns of that table is labeled "Unit Price" or something similar. Once you read the labels for each column (also called "column headers"), you the reader know where the table begins (below the column headers) and can identify the data in each row (by understanding what the column headers refer to).
This is also the basic idea behind the Tabular Layout Extraction Method. It too utilizes column header labels to "read" tables on documents, or at least as step number one in modeling the table's structure. Once Grooper knows where a column is, identified by the column's header label, Grooper can extract data from each cell in each row of that column.
The Tabular Layout method can establish column header locations in one of two ways:
- Using extractors
- Which are defined on the Data Columns' Header Extractor property (and optionally on the Data Table's Header Row Extractor property)
- Using Label Sets
- When a Labeling Behavior is enabled, column header locations are defined by labels collected for the Data Columns (and optionally for the Data Table)
- Effectively, the labels take the place of the Header Extractor results (and optionally the Header Row Extractor results)
Once the column header locations are established, the next thing Grooper needs to do is figure out where each row is. Tabular data is most often dynamic data. A table on one document might have two rows. The same table on the next might have twenty. How does Grooper know where each row is?
This is done by configuring at least one Data Column's Value Extractor property (However, more than one, even all, may be configured. Depending on how complicated the table is, you may need to configure extractors for multiple columns.)
Generally, there is at least one column in a table that is always present for every row in the table. If you can use an extractor to locate that data below its corresponding column header, that gives you a way of finding each row in the table. This allows Grooper to form a "row instance" for each row. Once the row instance is established, Grooper can then collect the various cell values for the various additional columns from the row instance.
If locating column headers and locating rows using column extractors was was all that was involved in Tabular Layout, that alone would make it a powerful tabular extraction method. What makes the Tabular Layout method even more powerful is its further configurability. Is every row in the table a single line or are the rows "multiline"? Do you need more fine-tuned data extraction from a cell's value or the row itself once the row instance is detected? Do you need to establish a table "footer" to limit the number of rows extracted? We will address these issues and more in the #Advanced Setup Considerations section of this article.
| FYI | If your familiar with the Header-Value table extraction method, you should see some similarities between it and the Tabular Layout method. Indeed both methods utilize column headers and Data Column Value Extractors to collect table data.
Tabular Layout should be seen as an improvement on Header-Value for the following reasons:
|
Basic Setup
Tabular Layout can be configured with or without the use of Label Sets. In either case, the basic setup is the same:
- Establish column headers for each Data Column.
- Detect row instances by assigning at least one Data Column's Value Extractor.
- Set the Data Table's Extract Method property to Tabular Layout.
- Test extraction and configure further as necessary.
With Label Sets or without, the setup is extremely similar. On top of that, there's nothing about using Label Sets that alters Tabular Layout's extraction logic. Grooper uses the same logic to model the table's structure and collect data for each cell. The biggest difference is how column headers are determined in step #1.
- Without Label Sets, column headers are established using extractors, defined using the Data Columns' Header Extractor property (and optionally using the Data Table's Header Row Extractor property)
- With Label Sets, column headers are established using labels, defined when collecting labels for each Document Type. The Data Columns' labels effectively take the place of the Header Extractor property's results.
Tabular Layout Without Label Sets
The Tabular Layout Method
This tutorial will cover the basic configuration of the Tabular Layout method without Label Sets, using extractors to collect column headers instead. We will use invoices for our document set and collect the following data from their tables detailing line item information:
- Item Number - The vendor's id number for the item ordered for each row.
- Description - The description of each item ordered for each row.
- Quantity - The number of the item ordered for each row.
- Unit Price - The vendor's price for the item ordered for each row.
- Line Total - The total price for the number of items ordered (In other words, the quantity ordered multiplied by the unit price)
|
The basic steps will be as follows:
In a perfect world, you're done at that point. As you can see in this example, we've populated a table. Data is collected for all four Data Columns for each row on the document. However, the world is rarely perfect. We will discuss some further configuration considerations to help you get the most out of this table extraction method in the #Advanced Setup Considerations section below. |
1. Configure Header Extractors
As far as strict requirements go for the Tabular Layout method goes, you must at minimum establish column headers for each Data Column you wish to extract.
We'll start with the "Quantity" Data Column.
- FYI: If the invoice lists both a "quantity ordered" and a "quantity shipped" column, we will be collecting the quantity shipped.
|
|
|
|
You would then continue adding variations to the list until all variations of the "Quantity" column's header labels are extracted for every variation of the table.
|
|
Pro Tip: Stacked LabelsYou will often find "stacked labels" in tables. These are multi-word labels broken up across multiple lines in the table's header.
|
|
|
To enable Vertical Wrap:
|
|
Repeat Until All Data Columns Are ConfiguredYou will repeat the same process for each Data Column you want to collect.
For our document set, we collected the following labels for the following columns: "Item Number" ITEM NO ITEM # Item Number Part Number/Description "Description" DESCRIPTION Part Number/Description
|
Optional/Alternative Configuration: Header Row Extractor
You may optionally establish column headers for the entire row of header labels, using the Header Row Extractor property. Instead of configuring each Data Column's Header Extractor, you would configure an extractor to return the whole table's row of column headers and use named instances (either Named Groups or child extractors) to establish each Data Column's header.
The answer has to do with imperfect OCR text data and Fuzzy RegEx. Fuzzy RegEx provides a way for regular expression patterns to match in Grooper when the text data doesn't strictly match the pattern. The difference between the regex pattern Grooper and the character string "Gro0per" is just off by a single character. An OCR engine misreading an "o" character for a zero is not uncommon by any means, but a standard regex pattern of Grooper will not match the string "Gro0per". The pattern expects there to be an "o" where there is a zero.
Using Fuzzy RegEx instead of regular regex, Grooper will evaluate the difference between the regex pattern and the string. If it's similar enough (if it falls within a percentage similarity threshold) Grooper will return it as a match.
- FYI "similarity" may also be referred to as "confidence" when evaluating (or scoring) fuzzy match results. Grooper is more or less confident the result matches the regex pattern based on the fuzzy regex similarity between the pattern and the imperfect text data. A similarity of 90% and a confidence score of 90% are functionally the same thing (One could argue there is a difference between these two terms when Fuzzy Match Weightings come into play, but that's a whole different topic. And you may encounter Grooper users who use the terms "similarity" and "confidence" interchangeably regardless. Visit the Fuzzy RegEx article if you would like to learn more).
|
So how does this apply to the Data Table's column header row label? The short answer is it provides a way to increase the accuracy of Data Column column header labels by "boosting" the similarity of the label to imperfect OCR results.
As we will see, capturing the full row of column header labels will boost the similarity, allowing the label to match without altering the Label Behavior's fuzzy match settings. |
|
|
First, notice what's happened when we lassoed the row of column header labels.
|
|
Not magic. Just math. The Data Table's column header row label is much much longer than a single Data Column's column header label. There are just more characters in "Qty. Qty. Item Number Description Unit Price Extended Price\r\nOrd. Shp." than "Description" (70 vs 11). Where the "Description" Data Column's label is roughly 82% similar to the text data (9 out of 11 characters), the "Line Item" Data Table's label, comprised of the whole row of column labels, is roughly 96% similar to the text data (67 out of 70 characters). Utilizing a Data Table label allows you to hijack the whole row's similarity score when a single Data Column does not meet the similarity threshold. If the label can be matched as a part of the larger whole, its confidence score goes up much further than by itself. The Data Table's larger label of the full row of column labels gives extra context to the "Description" Data Column's label, providing more information about what is and is not an appropriate match. So why is it considered best practice to capture a label for the Data Table? OCR errors are unpredictable. The set of examples you worked with when architecting this solution may have been fairly clean with good OCR reads. That may not always be the case. Capturing a Data Table label for the column label row will act as a safety net to avoid unforeseen problems in the future. |
Assign a Data Column's Value Extractor
Step 1 is done. We've collected labels for the "Line Item" Data Table and its Data Columns for each Document Type in this Content Model. Step 2 is configuring and assigning a Value Extractor for at least one Data Column.
Why is this necessary? Think about what we've done so far. We've collected labels for the Data Columns. Grooper now has a way to figure out where the columns are on the document. But what does it know about the rows?
Rows come under columns. We know that much. So, Grooper at least knows to look for rows underneath the collected Data Column labels. But that's about it. It doesn't know the size of each row. It doesn't know the spacing between the rows. Probably most importantly, it doesn't know how many rows there are. Tables tend to be dynamic. They may have 3 rows on one document and 300 on the next. Grooper needs a way of detecting this.
|
Indeed, if we were to test extraction with just labels collected, we would not get any result whatsoever.
|
|
|
This is why we need a Data Column's Value Extractor property configured, to give the Extract activity an awareness of the rows beneath the column labels. The key thing to keep in mind is this data must be present on every row. You'll want to pick a column whos data is always present for every row, where it would be considered invalid if the information wasn't in that cell for a given row. In our case, we will choose the "Quantity" Data Column. We always expect there to be a quantity listed for the line item on the invoice, even if that quantity is just "1".
|
|
|
This is the pattern we will use for the "Quantity" Data Column's Value Extractor.
We get a bunch of other hits as well. This is a very generic extractor matching very generic numerical data.
|
For fairly simple table structures we now have the two things the Tabular Layout method needs to extract data.
So far, we have:
- Collected labels for the Data Column labels (and optionally the whole row of column labels for the Data Table)
- Configured at least one Data Column with its Value Extractor configured.
Now, all we need to do is tell the Data Table object we want to use the Tabular Layout method. We do this by setting its Extract Method property to Tabular Layout.
Set Extract Method to Tabular Layout and Test
|
A Data Table's extraction method is set using the Extract Method property. To enable the Tabular Layout method, do the following.
|
|
|
Now, let's test out what we have and see what we get!
For the Tabular Layout method, the Data Table is populated using primarily two pieces of information.
|
|
|
With these pieces of information, the Tabular Layout method can start to determine the table's structure. If you know where the columns are and how big they are, and you know how many rows there are, you pretty much know what the table looks like. This allows Grooper to create data instances for each cell in the table.
|
Tabular Layout With Label Sets
Label Sets and Tabular Layout
This tutorial will cover the basic configuration of the Tabular Layout method using collected Label Sets. We will use invoices for our document set and collect the following data from their tables detailing line item information:
- Description - The description of each item ordered for each row.
- Quantity - The number of the item ordered for each row.
- Unit Price - The vendor's price for the item ordered for each row.
- Line Total - The total price for the number of items ordered (In other words, the quantity ordered multiplied by the unit price)
|
The basic steps will be as follows:
In a perfect world, you're done at that point. As you can see in this example, we've populated a table. Data is collected for all four Data Columns for each row on the document. However, the world is rarely perfect. We will discuss some further configuration considerations to help you get the most out of this table extraction method in the "Additional Considerations" section below. |
Collect Labels
See the above how to (Collect Label Sets) for a full explanation of how to collect labels for Document Types in a Content Model. The following tutorial will presume you have general familiarity with collecting labels.
|
As far as strict requirements for collecting labels for tabular data extraction goes, you must at minimum collect a label for each Data Column you wish to extract. For this "Stuff and Things" Document Type, one column header label has been collected for each of the four Data Column children of the "Line Items" Data Table.
|
|
|
You may optionally collect a label for the entire row of column header labels. This label is collected for the parent Data Table object's label.
It is generally considered best practice to capture a header row label for the Data Table. But if it's optional, why do it? What is the benefit of this label? |
The answer has to do with imperfect OCR text data and Fuzzy RegEx. Fuzzy RegEx provides a way for regular expression patterns to match in Grooper when the text data doesn't strictly match the pattern. The difference between the regex pattern Grooper and the character string "Gro0per" is just off by a single character. An OCR engine misreading an "o" character for a zero is not uncommon by any means, but a standard regex pattern of Grooper will not match the string "Gro0per". The pattern expects there to be an "o" where there is a zero.
Using Fuzzy RegEx instead of regular regex, Grooper will evaluate the difference between the regex pattern and the string. If it's similar enough (if it falls within a percentage similarity threshold) Grooper will return it as a match.
- FYI "similarity" may also be referred to as "confidence" when evaluating (or scoring) fuzzy match results. Grooper is more or less confident the result matches the regex pattern based on the fuzzy regex similarity between the pattern and the imperfect text data. A similarity of 90% and a confidence score of 90% are functionally the same thing (One could argue there is a difference between these two terms when Fuzzy Match Weightings come into play, but that's a whole different topic. And you may encounter Grooper users who use the terms "similarity" and "confidence" interchangeably regardless. Visit the Fuzzy RegEx article if you would like to learn more).
|
So how does this apply to the Data Table's column header row label? The short answer is it provides a way to increase the accuracy of Data Column column header labels by "boosting" the similarity of the label to imperfect OCR results.
As we will see, capturing the full row of column header labels will boost the similarity, allowing the label to match without altering the Label Behavior's fuzzy match settings. |
|
|
First, notice what's happened when we lassoed the row of column header labels.
|
|
Not magic. Just math. The Data Table's column header row label is much much longer than a single Data Column's column header label. There are just more characters in "Qty. Qty. Item Number Description Unit Price Extended Price\r\nOrd. Shp." than "Description" (70 vs 11). Where the "Description" Data Column's label is roughly 82% similar to the text data (9 out of 11 characters), the "Line Item" Data Table's label, comprised of the whole row of column labels, is roughly 96% similar to the text data (67 out of 70 characters). Utilizing a Data Table label allows you to hijack the whole row's similarity score when a single Data Column does not meet the similarity threshold. If the label can be matched as a part of the larger whole, its confidence score goes up much further than by itself. The Data Table's larger label of the full row of column labels gives extra context to the "Description" Data Column's label, providing more information about what is and is not an appropriate match. So why is it considered best practice to capture a label for the Data Table? OCR errors are unpredictable. The set of examples you worked with when architecting this solution may have been fairly clean with good OCR reads. That may not always be the case. Capturing a Data Table label for the column label row will act as a safety net to avoid unforeseen problems in the future. |
Assign a Data Column's Value Extractor
Step 1 is done. We've collected labels for the "Line Item" Data Table and its Data Columns for each Document Type in this Content Model. Step 2 is configuring and assigning a Value Extractor for at least one Data Column.
Why is this necessary? Think about what we've done so far. We've collected labels for the Data Columns. Grooper now has a way to figure out where the columns are on the document. But what does it know about the rows?
Rows come under columns. We know that much. So, Grooper at least knows to look for rows underneath the collected Data Column labels. But that's about it. It doesn't know the size of each row. It doesn't know the spacing between the rows. Probably most importantly, it doesn't know how many rows there are. Tables tend to be dynamic. They may have 3 rows on one document and 300 on the next. Grooper needs a way of detecting this.
|
Indeed, if we were to test extraction with just labels collected, we would not get any result whatsoever.
|
|
|
This is why we need a Data Column's Value Extractor property configured, to give the Extract activity an awareness of the rows beneath the column labels. The key thing to keep in mind is this data must be present on every row. You'll want to pick a column whos data is always present for every row, where it would be considered invalid if the information wasn't in that cell for a given row. In our case, we will choose the "Quantity" Data Column. We always expect there to be a quantity listed for the line item on the invoice, even if that quantity is just "1".
|
|
|
This is the pattern we will use for the "Quantity" Data Column's Value Extractor.
We get a bunch of other hits as well. This is a very generic extractor matching very generic numerical data.
|
For fairly simple table structures we now have the two things the Tabular Layout method needs to extract data.
So far, we have:
- Collected labels for the Data Column labels (and optionally the whole row of column labels for the Data Table)
- Configured at least one Data Column with its Value Extractor configured.
Now, all we need to do is tell the Data Table object we want to use the Tabular Layout method. We do this by setting its Extract Method property to Tabular Layout.
Set Extract Method to Tabular Layout and Test
|
A Data Table's extraction method is set using the Extract Method property. To enable the Tabular Layout method, do the following.
|
|
|
Now, let's test out what we have and see what we get!
For the Tabular Layout method, the Data Table is populated using primarily two pieces of information.
|
|
|
With these pieces of information, the Tabular Layout method can start to determine the table's structure. If you know where the columns are and how big they are, and you know how many rows there are, you pretty much know what the table looks like. This allows Grooper to create data instances for each cell in the table.
|
Advanced Setup Considerations
Multiline Rows
|
The table from the previous example had a single-line table structure. Each row occupied one line. Table extraction can get a little trickier when tables have a multiline row structure, especially if sometimes the table rows occupy a single line and sometimes multiple. For example, most the rows on the line items table for the "Arve" Document Type are single-line rows. However, for some rows, the "Description" column spans multiple lines.
|
|
|
This is what the Multiline Rows property is for. Enabling this property will allow you to target table structures like this whose rows extend beyond just a single line on the page.
|
|
|
The Multiline Rows functionality will even detect multiline rows if the lines start on one page and continue to another.
|
|
Generally speaking, a "footer" is something on a document indicating the end of a portion of text. It could be the end of a page, end of a paragraph, end of a chapter, or even the end of something like a table. If you can find some sort of text label that signifies the end of the table, you can often avoid extraction errors where the Tabular Layout method overextends, capturing extra rows of "junk data".
We need a way to remove this row. We need some way of telling Grooper this row is invalid. We can do that very easily in this case with a Footer Label.
|
|
|
To do this, we will need to update the "Factura" Document Type's Label Set.
|
|
|
Primary VS Secondary Extraction
|
As we've seen before at least one Data Column must have its Value Extractor property configured. This is (part of) how the Tabular Layout method models the row structure of the table. Grooper needs some piece of information to tell where each row is (and how many rows to the table there are). In other words, it needs at least one Data Column's Value Extractor results to detect each row.
|
|
|
|
|
A Data Column's Value Extractor may also be used for data validation and/or cleansing purposes as well as finding the rows in a table.
|
|
|
If we assign this extractor to the "Line Total" Data Column's Value Extractor, upon extraction the Data Table will collect the extractor's result, ultimately cleansing the imperfect text data and returning the valid currency amount.
|
|
|
Now, with the "Line Total" Data Column utilizing an extractor, we will get a more accurate result.
|
How and Why to Disable Row Detection
Continuing from the example in the previous tab, let's play "What if...?"
What if for rows with a zero dollar "Unit Price" amount and zero dollar "Line Total" amount, those cells were totally blank instead of "0.00"?
If that were the case, by configuring the "Line Total" Data Column's Value Extractor, we actually would end up throwing out the row entirely, preventing us from collecting data we want.
|
Remember, a Data Column's Value Extractor results are used to detect each row in the table. If there's no result, there's no row. Furthermore, by having two Data Columns' Value Extractors configured, both pieces of information are going to be required to find and return the row.
Oh oh! This means this is no longer a valid row according to the default use of the two Data Columns' Value Extractor results. |
|
Because the "Line Total" Data Column now has its Value Extractor configured, it must match a result in order to collect a table row. This is obviously not want we want. There's still data we want on that row, namely the "Quantity" and "Description" column data. |
|
|
Truly, we only want the "Quantity" Data Column to be used for row detection (We're presuming every row has a value for the "Quantity" column). We don't really want or need the "Line Items" Data Column's Value Extractor to detect the rows (The "Quantity" Data Column's Value Extractor is already doing that job). Instead, we want to perform secondary extraction for the "Line Items" column, once a row is already located (and a row instance is established for each row's data). This is what the Tabular Layout' method's Column Settings properties are for. This set of properties will allow you to choose if and how a Data Column's Value Extractor is used for row detection or secondary extraction after a row instance is formed.
|
|
|
From here, you select the Data Column you wish to configure. These settings will allow you to override the default settings for how a Data Column's Value Extractor is used for row detection and secondary extraction.
|
|
FYI |
The Secondary Extract property will control if secondary extraction is performed at all (Always or Never). The Secondary Extract Mode will control how secondary extraction is performed. The Data Column's Value Extractor can run against the cell in the row's text (CellExtract), it can run against the full row's text (RowExtract), or the extractor can be ignored entirely and the text falling within the geometric boundaries of the cell will be returned (Geometric). |
By manipulating the extraction logic of the Data Columns, we have more control over how those results are used for row detection or secondary extraction for data validation and cleansing. |
































