2021:Tabular Layout (Table Extract Method): Difference between revisions
Dgreenwood (talk | contribs) |
Dgreenwood (talk | contribs) |
||
| Line 603: | Line 603: | ||
<tabs style="margin:20px"> | <tabs style="margin:20px"> | ||
<tab name=" | <tab name="The Tabular Layout Method" style="margin:20px"> | ||
=== | === The Tabular Layout Method === | ||
This tutorial will cover the basic configuration of the '''''Tabular Layout''''' method | This tutorial will cover the basic configuration of the '''''Tabular Layout''''' method ''with'' Label Sets, using a '''''Labeling Behavior''''' to collect column headers. We will use invoices for our document set and collect the following data from their tables detailing line item information: | ||
* Item Number - The vendor's id number for the item ordered for each row. | |||
* Description - The description of each item ordered for each row. | * Description - The description of each item ordered for each row. | ||
* Quantity - The number of the item ordered for each row. | * Quantity - The number of the item ordered for each row. | ||
| Line 616: | Line 617: | ||
The basic steps will be as follows: | The basic steps will be as follows: | ||
# | # Establish column headers by collecting labels for each '''Data Column''' in the '''Data Table'''. | ||
#* | #* You must collect header labels for each '''Data Column''' whose data you want to collect. | ||
#* | #* You may optionally collect a label for the entire row of header labels by collecting a label for the '''Data Table'''. | ||
#** It is also considered best practice to do so when using Label Sets to configure '''''Tabular Layout'''''. | |||
# Assign a '''''Value Extractor''''' for at least one '''Data Column'''. | # Assign a '''''Value Extractor''''' for at least one '''Data Column'''. | ||
#* | #* For example, we may expect to find a quantity for each item shipped on an invoice, regardless of the vendor. There's always a column with a "Quantity" or "QTY" or "Shipped" or some similar header. | ||
#* | #* Since this data is also present on ''every row'', this will provide the information necessary to find each row in the table. | ||
# Set the '''Data Table''' object's '''''Extract Method''''' property to ''Tabular Layout'' | #* While you need ''at least'' one '''Data Column's''' '''''Value Extractor''''' configured to detect rows, multiple columns may be used to detect rows. | ||
#** Furthermore, a '''Data Column's''' '''''Value Extractor''''' will either perform "Primary Extraction" to perform row detection or "Secondary Extraction" to extract data from already detected rows. We will discus using multiple columns to detect rows and the differences between "Primary" and "Secondary Extraction" in the [[#Advanced Setup Considerations]] section of this article. | |||
# Set the '''Data Table''' object's '''''Extract Method''''' property to ''Tabular Layout''. | |||
#* And configure any '''''Tabular Layout''''' properties as needed. We will discuss many of these properties, why and how to to use them in the [[#Advanced Setup Considerations]] section of this article. | |||
# Test to ensure the table's data is collected. | # Test to ensure the table's data is collected. | ||
|valign=top| | |||
[[File:2021-tabular-layout-without-label-sets-01.png]] | |||
|} | |||
In a perfect world, you're done at that point. As you can see in this example, we've populated a table. Data is collected for all four '''Data Columns''' for each row on the document. | In a perfect world, you're done at that point. As you can see in this example, we've populated a table. Data is collected for all four '''Data Columns''' for each row on the document. | ||
However, the world is rarely perfect. We will discuss some further configuration considerations to help you get the most out of this table extraction method in the | However, the world is rarely perfect. We will discuss some further configuration considerations to help you get the most out of this table extraction method in the [[#Advanced Setup Considerations]] section below. | ||
</tab> | |||
<tab name="1. Collect Column Labels" style="margin:20px"> | |||
=== 1. Collect Column Labels === | |||
[[ | ''The following tutorial will presume you have general familiarity with collecting labels. See the [[Label Sets]] article for a full explanation of how to collect labels for Document Types in a Content Model.'' | ||
As far as ''strict'' requirements go for the '''''Tabular Layout''''' method goes, you must ''at minimum'' establish column headers for each '''Data Column''' you wish to extract. | |||
' | We'll start with the "Quantity" '''Data Column'''. | ||
* FYI: If the invoice lists both a "quantity ordered" and a "quantity shipped" column, we will be collecting the quantity ''shipped''. | |||
{|cellpadding=10 cellspacing=5 | {|cellpadding=10 cellspacing=5 | ||
|valign=top style="width:40%"| | |valign=top style="width:40%"| | ||
<br> | |||
For this "Fairdeal" '''Document Type''', one column header label has been collected for each of the five '''Data Column''' children of the "Line Items" Data Table. | |||
# The label <code>ITEM NO</code> for the "Item Number" '''Data Column''' | |||
# The label <code>DESCRIPTION</code> for the "Description" '''Data Column''' | |||
# The label <code>HRS / QTY</code> for the "Quantity" '''Data Column''' | |||
# The label <code>RATE / PRICE</code> for the "Unit Price" '''Data Column''' | |||
# The label <code>SUBTOTAL</code> for the "Line Total" '''Data Column''' | |||
As far as strict requirements go for establishing header columns, you're done at this point. You would then repeat this same process for every '''Document Type''' in your '''Content Model'''. | |||
|valign=top| | |valign=top| | ||
You may optionally collect a label for the | [[File:2021-tabular-layout-with-label-sets-01.png]] | ||
|} | |||
=== Best Practice: Collect a Header Row Label for the Data Table === | |||
You may optionally collect a label for the entire row of column header labels (aka the "header row label"). This label is collected for the parent '''Data Table''' object's label. | |||
{|cellpadding=10 cellspacing=5 | |||
|valign=top style="width:40%"| | |||
<br> | |||
# We've collected the label <code>DESCRIPTION ITEM NO HRS / QTY PER RATE / PRICE SUBTOTAL</code> for the "Line Items" '''Data Table'''. | |||
It is | It is considered best practice to capture a header row label for the '''Data Table'''. But if it's optional, why do it? What is the benefit of this label? | ||
| | |valign=top| | ||
[[File: | [[File:2021-tabular-layout-with-label-sets-02.png]] | ||
| | |- | ||
|valign=top colspan=2| | |||
=== Why Bother? === | |||
There are two main reasons why Header Row Extractors can be beneficial: | |||
# To throw out false positive column header matches | |||
|valign=top | # To better match column headers with poor OCR using Fuzzy RegEx. | ||
|- | |||
|valign=top| | |||
=== To Throw Out False Positives === | |||
The first reason to collect a header row label is to help eliminate false positive column header matches. | |||
# Take our "Line Total" '''Data Column's''' label <code>SUBTOTAL</code>. | |||
# Without the '''Data Table's''' header row label, this label would ''also'' produce a match. | |||
#* This is a false positive match. This is an instance on this document where the same term is used to refer to something different. | |||
# With the header row label, only the actual label for the column matches. | |||
#* Another way of putting it: The '''Data Column''' header labels will only match if they are part of the larger '''Data Table''' header row label. | |||
|valign=top| | |valign=top| | ||
[[File: | [[File:2021-tabular-layout-with-label-sets-03.png]] | ||
|- | |||
|valign=top colspan=2| | |||
=== For Fuzzy RegEx === | |||
The other reason to collect a header row label has to do with imperfect OCR text data and ''[[Fuzzy RegEx]]''. ''Fuzzy RegEx'' provides a way for regular expression patterns to match in '''Grooper''' when the text data doesn't ''strictly'' match the pattern. The difference between the regex pattern <code>Grooper</code> and the character string "Gro0per" is just off by a single character. An OCR engine misreading an "o" character for a zero is not uncommon by any means, but a standard regex pattern of <code>Grooper</code> will not match the string "Gro0per". The pattern expects there to be an "o" where there is a zero. | |||
Using ''Fuzzy RegEx'' instead of regular regex, '''Grooper''' will evaluate the difference between the regex pattern and the string. If it's similar enough (if it falls within a percentage similarity threshold) '''Grooper''' will return it as a match. | |||
* FYI "similarity" may also be referred to as "confidence" when evaluating (or scoring) fuzzy match results. '''Grooper''' is more or less "confident" the result matches the regex pattern based on the fuzzy regex similarity between the pattern and the imperfect text data. A similarity of 90% and a confidence score of 90% are functionally the same thing (One could argue there is a difference between these two terms when '''''Fuzzy Match Weightings''''' come into play, but that's a whole different topic. And you may encounter '''Grooper''' users who use the terms "similarity" and "confidence" interchangeably regardless. Visit the [[Fuzzy RegEx]] article if you would like to learn more). | |||
|- | |- | ||
|valign=top| | |valign=top| | ||
<br> | |||
So how does this apply to the '''Data Table's''' header row label? The short answer is it provides a way to increase the accuracy of '''Data Column''' header labels by "boosting" the similarity of the label to imperfect OCR results. | |||
# We're going to look at labels collected for the "Rechnung" '''Document Type''' to illustrate this. | |||
# Examine the collected label for the "Line Total" '''Data Column'''. | |||
#* Notice the label <code>TOTAL</code> is highlighted red. The label doesn't match the text on the document. | |||
#* This is due to imperfect OCR results. | |||
# OCR made some missteps and recognized that segment as <code>TOFAL</code>. | |||
#* The second "T" in "TOTAL" was recognized as an "F" character. | |||
#* This means "TOTAL" (the expected label) is one character's difference from "TOFAL" (the actual text data). Or, "TOFAL" is 80% similar to "TOTAL". | |||
#* The '''''Labeling Behavior's''''' similarity threshold is set to 90% for this '''Content Model'''. 80% is less than 90%. So, the result is thrown out. | |||
#* FYI: This threshold is configured when the '''''Labeling Behavior''''' is added, using the '''''Behaviors''''' property of a Content Model. The '''''Label Similarity''''' property is set to ''90%'' by default, but can be adjusted at any time. | |||
As we will see, capturing the full row of column header labels will boost the similarity, allowing the label to match without altering the '''''Labeling Behavior's''''' fuzzy match settings. | |||
|valign=top| | |||
[[File:2021-tabular-layout-with-label-sets-04.png]] | |||
| | |||
[[File: | |||
|- | |- | ||
|valign=top| | |valign=top| | ||
# | <br> | ||
# Now the label matches | # Here, we've collected a header row label for the '''Data Column'''. | ||
#<li value = 2> Now the "Line Total" '''Data Column's''' label matches! MAGIC! | |||
Not magic. Just math. | Not magic. Just math. | ||
The '''Data Table's''' column header row label is much much longer than a single '''Data Column's''' column header label. | The '''Data Table's''' column header row label is much much longer than a single '''Data Column's''' column header label. There are just more characters in <code>PO ITEM # DESCRIPTION QUANTITY UNIT PRICE TOTAL\r\nLINE #</code> than <code>TOTAL</code> (55 vs 5). | ||
* Where the "Line Total" '''Data Column's''' label is 80% similar to the text data (4 out of 5 characters), the "Line Item" '''Data Table's''' label, comprised of the whole row of column labels, is roughly 98% similar to the text data (54 out of 55 characters). | |||
Utilizing a '''Data Table''' label allows you to hijack the whole row's similarity score when a single '''Data Column''' does not meet the similarity threshold. | Utilizing a '''Data Table''' label allows you to hijack the whole row's similarity score when a single '''Data Column''' does not meet the similarity threshold. | ||
* If the label can be matched as a part of the larger whole, its confidence score goes up much further than by itself. | |||
* The '''Data Table's''' larger label of the full row of column labels gives extra context to the "Line Items" '''Data Column''' label, providing more information about what is and is not an appropriate match. | |||
|valign=top| | |||
[[File:2021-tabular-layout-with-label-sets-05.png]] | |||
|} | |||
So why is it considered best practice to capture a label for the '''Data Table'''? | So why is it considered best practice to capture a header row label for the '''Data Table'''? OCR errors are unpredictable. | ||
The set of examples you worked with when architecting this solution may have been fairly clean with good OCR reads. Maybe it didn't seem like you needed a '''Data Table''' label at the time, but that may not always be the case. Capturing a '''Data Table''' label for the header row will act as a safety net to avoid unforeseen problems in the future. | |||
</tab> | </tab> | ||
<tab name="2. Assign a Data Column's Value Extractor" style="margin:20px"> | <tab name="2. Assign a Data Column's Value Extractor" style="margin:20px"> | ||
=== Assign a Data Column's Value Extractor === | === 2. Assign a Data Column's Value Extractor === | ||
This step is all about '''''row detection'''''. | |||
So far all we've done is established header column positions on each document. But, that's not where the data is. The table's data is in the ''rows''. | |||
As it stands, Grooper doesn't know anything about the rows in the tables. It doesn't know the size of each row. It doesn't know what kind of data is supposed to be in the rows. Maybe most importantly, it doesn't know ''how many'' rows there are. Tables tend to be dynamic. They may have 3 rows on one document and 300 on the next. Grooper needs a way of detecting this. | |||
{|cellpadding=10 cellspacing=5 | {|cellpadding=10 cellspacing=5 | ||
|valign=top style="width:40%"| | |valign=top style="width:40%"| | ||
To detect rows, we need at least one '''Data Column's''' '''''Value Extractor''''' property configured. For each result the extractor produces below the column's header, Grooper will create one row instance. | |||
The key thing to keep in mind is this data ''must'' be present on every row. You'll want to pick a column whos data is always present for every row, where it would be considered invalid if the information wasn't in that cell for a given row. | |||
In our case, we will choose the "Quantity" '''Data Column'''. We always expect (for the time being anyway) there to be a quantity listed for the line item on the invoice. | |||
# | # We will use this '''Value Reader''' for our demonstration. | ||
# This | #* However, in the real world, the extraction world is your oyster. You'll configure an extractor to best target the data in whatever table column you're trying to extract. | ||
# As you can see, we get | # This is a fairly simple '''''Pattern Match''''' extractor designed to return numeric data (including currency). | ||
| | # The regex is a fairly simple pattern to match generic quantities. | ||
[[File: | #* It'll match decimal values from 0 and above with two decimal places optional. | ||
# We've also edited our '''''Prefix''''' and '''''Suffix Patterns''''' so that the pattern must be surrounded by a space character before and after, with an optional dollar sign before the number. | |||
# As you can see, we get five results below the "Quantity" label. | |||
#* When we assign this '''Value Reader''' to the "Quantity" '''Data Column''', we should then get five rows when this table extracts. | |||
|valign=top| | |||
[[File:2021-tabular-layout-without-label-sets-09.png]] | |||
|- | |- | ||
|valign=top| | |valign=top| | ||
<br> | |||
We do get a bunch of other hits as well. This is a very generic extractor matching very generic numerical data. | |||
# | # Will this result present a problem? Will we get an extra row for its result? | ||
# | #* No. That result is ''above'' the header label <code>HRS / QTY</code>. | ||
#* | #* The '''''Tabular Layout''''' method presumes rows are ''below'' column labels. Any and all results above the first instance of the column's headers will be ignored. | ||
# What about these matching results on the same line? Will the extra results create additional row instances? | |||
#* No. These results are misaligned with the "Quantity" '''Data Column's''' header. They are too far to the right to be considered under the column header. They will be ignored. | |||
#* Only results aligned with the "Quantity" '''Data Column's''' header will create a row instance. | |||
# What about these results? Will they produce a row? | |||
#* No. These results are also misaligned with the "Quantity" '''Data Column's''' header. | |||
#* That said, if these ''were'' aligned with the "Quantity" '''Data Column's''' header, they ''would'' produce row instances. | |||
#* When you are building your own '''Data Column''' extractors, pay close attention to results below the column's header. They have the most potential to produce false positive results, producing erroneous rows. | |||
#** That said, there are a multitude of ways to avoid false positive row results when using '''Data Columns'''' '''''Value Extractors''''' to detect rows. We will discuss this more in the [[#Advanced Setup Considerations]] portion of this article. | |||
| | | | ||
[[File: | [[File:2021-tabular-layout-without-label-sets-10.png]] | ||
|- | |- | ||
|valign=top| | |valign=top| | ||
<br> | |||
With our extractor ready to go, all we need to do is assign it to the "Quantity" '''Data Column''' using its '''''Value Extractor''''' property. | |||
# Select the '''Data Column''' you wish to configure. | |||
#* In our case, we want to configure the "Quantity" '''Data Column'''. | |||
# Configure the '''''Value Extractor''''' property. | |||
#* In our case, we've referenced our '''Value Reader''' designed to return generic numeric values. | |||
{|cellpadding="10" cellspacing="5" | |||
|-style="background-color:#36b0a7; color:white" | |||
|style="font-size:14pt"|'''FYI'''||At ''bare minimum'' you must configure at least one '''Data Column's''' '''''Value Extractor''''' to perform row detection. | |||
However, multiple columns may be used to perform row detection by configuring their corresponding '''Data Columns''' '''''Value Extractor''''' properties. For more information on using multiple columns in row detection (as well as row detection in general) please visit the [[#Advanced Row Detection]] section of this article. | |||
|} | |||
|valign=top| | |||
[[File:2021-tabular-layout-without-label-sets-11.png]] | |||
| | |||
[[File: | |||
|} | |} | ||
So far, we have: | So far, we have: | ||
# Collected labels for the '''Data Column''' labels (and optionally the | # Collected labels for the '''Data Column''' labels (and optionally the header row label for the '''Data Table''') | ||
# Configured at least one '''Data Column''' with its '''''Value Extractor''''' configured. | # Configured at least one '''Data Column''' with its '''''Value Extractor''''' configured. | ||
Now, all we need to do is tell the '''Data Table''' object we want to use the ''Tabular Layout'' method. We do this by setting its '''''Extract Method''''' property to ''Tabular Layout''. | For fairly simple table structures, we now have the two things the '''''Tabular Layout''''' method needs to extract data. Now, all we need to do is tell the '''Data Table''' object we want to use the '''''Tabular Layout''''' method. We do this by setting its '''''Extract Method''''' property to ''Tabular Layout''. | ||
</tab> | </tab> | ||
<tab name="3. Set Extract Method to Tabular Layout and 4. Test" style="margin:20px"> | <tab name="3. Set Extract Method to Tabular Layout and 4. Test" style="margin:20px"> | ||
{|cellpadding=10 cellspacing=5 | {|cellpadding=10 cellspacing=5 | ||
|valign=top style="width:40%"| | |valign=top style="width:40%"| | ||
A '''Data Table's''' extraction method is set using the '''''Extract Method''''' property. To enable the ''Tabular Layout'' method, do the following. | === 3. Set Extract Method to Tabular Layout === | ||
A '''Data Table's''' extraction method is set using the '''''Extract Method''''' property. To enable the '''''Tabular Layout''''' method, do the following. | |||
# Select a '''Data Table''' object in your '''Data Model'''. | # Select a '''Data Table''' object in your '''Data Model'''. | ||
| Line 785: | Line 835: | ||
# Using the dropdown menu, select ''Tabular Layout'' | # Using the dropdown menu, select ''Tabular Layout'' | ||
| | | | ||
[[ File: | [[File:2021-tabular-layout-without-label-sets-12.png]] | ||
|- | |- | ||
|valign=top| | |valign=top| | ||
=== 4. Test === | |||
Now, let's test out what we have and see what we get! | Now, let's test out what we have and see what we get! | ||
# For the selected document folder in the "Batch Viewer" window... | # For the selected document folder in the "Batch Viewer" window... | ||
# Press the "Test Extraction" button. | # Press the "Test Extraction" button. | ||
# The results show up in the "Data Element Preview" window. | # The results show up in the "Data Element Preview" window. | ||
#* Success! Our table's data is collected! | |||
|valign=top| | |||
[[File:2021-tabular-layout-without-label-sets-13.png]] | |||
|- | |||
|valign=top| | |||
<br> | |||
So, how was Grooper able to do this? For the ''Tabular Layout'' method, the '''Data Table''' is populated using primarily two pieces of information: column header locations established by the '''Data Columns'''' labels and rows locations detected by a '''Data Column's''' '''''Value Extractor'''''. | |||
* Remember, we collected labels for ''all'' '''Data Columns'''. We configured ''only'' the "Quantity" '''Data Column's''' '''''Value Extractor''''''. | |||
First, it's all about establishing column headers. | |||
# The '''Data Columns'''' labels established the column locations for each column. | |||
# Grooper then determines the ''width'' of these columns. | |||
#* If table lines are present, Grooper can detect those line locations via a '''Line Detection''' (or '''Line Removal''') '''IP Command'''. Grooper will "snap" the column's width to the detected line boundaries, expanding the cell's width (and height) to the boundaries around it. | |||
#** Table lines give human readers an indicator of where the data "lives" (or is contained). If it's in the box, it belongs to the column. If it's out of the box, it belongs to a different column. | |||
#* If table lines are ''not'' present (as is the case for this document), Grooper will average the distance between one header label and the next. | |||
|valign=top| | |||
[[File:2021-tabular-layout-without-label-sets-14.png]] | |||
|- | |||
|valign=top| | |||
<br> | |||
Second, it's all about detecting rows. Rows are detected using a '''Data Column's''' '''''Value Extractor'''''. | |||
* In our case, we configured the "Quantity" '''Data Column's''' '''''Value Extractor'''''. | |||
* FYI: When a '''Data Column's''' extractor is used to detect rows, it is considered "Primary Extraction". A '''Data Column's''' extractor can also be used for "Secondary Extraction", performed ''after'' rows are detected. For more on this, please visit the [[#Primary VS Secondary Extraction]] section of this article. | |||
# | # Rows are only detected below the detecting '''Data Column's''' header. | ||
# | # Grooper runs the detecting '''Data Column's''' '''''Value Extractor''''', looking for matching results aligned below the column header. | ||
# For each result returned, Grooper establishes one row instance. | |||
# | #* Since our extractor was designed to return decimal values, and Grooper found five decimal values below our column header, Grooper detected five rows. | ||
#* | |||
|valign=top| | |valign=top| | ||
[[File: | [[File:2021-tabular-layout-without-label-sets-15.png]] | ||
|- | |- | ||
|valign=top| | |valign=top| | ||
<br> | |||
The ''Tabular Layout'' method now has the two pieces of information it needs to determine the table's structure. If you know where the columns are and how big they are, and you know how many rows there are, you pretty much know what the table looks like. | |||
# It has column instances for each '''Data Column'''. | |||
#* Again, established by each '''Data Column's''' label. | |||
# It has row instances for each detected row. | |||
#* Again, established by the detecting '''Data Column's''' '''''Value Extractor'''''. | |||
#** FYI: More than one '''Data Column''' can be used to detect rows. Please visit the [[#Advanced Row Detection]] section for more information. | |||
|valign=top| | |||
[[File:2021-tabular-layout-without-label-sets-16.png]] | |||
|- | |||
|valign=top| | |||
<br> | |||
With these column and row instances established, '''Grooper''' can form data instances for each cell of the table. | |||
# | #<li value=3> Each cell's data simply lays where the columns and rows intersect. | ||
# | #* For '''Data Columns''' ''with'' their '''''Value Extractors''''' configured, values are either collected using "Primary" or "Secondary Extraction". Please see the [[#Primary VS Secondary Extraction]] portion for more information. | ||
#* For '''Data Columns''' ''without'' their '''''Value Extractors''''' configured, values are collected by returning the OCR or native text data within the geometric boundaries of the cell. | |||
| | | | ||
[[File: | [[File:2021-tabular-layout-without-label-sets-17.png]] | ||
|} | |} | ||
</tab> | </tab> | ||
:[[#Tabular Layout Without Label Sets|Click here to return to the top]] | |||
</tabs> | </tabs> | ||
Revision as of 16:14, 16 May 2022
| WIP | This article is a work-in-progress or created as a placeholder for testing purposes. This article is subject to change and/or expansion. It may be incomplete, inaccurate, or stop abruptly.
This tag will be removed upon draft completion. |
Tabular Layout is one of Grooper's methods of extracting table data from documents available to Data Table objects (via its Extract Method property). This method uses column header values determined by the Data Columns Header Extractor results (or labels collected for the Data Columns when a Labeling Behavior is enabled) as well as Data Column Value Extractor results to model a table's structure and return its values.
The Tabular Layout is "Label Set aware". You can configure Tabular Layout with or without labels. This article will detail both methods. For more information on Label Sets, please visit the full Label Sets article.
About
Many tables label the columns so the reader knows what the data in that column corresponds to. How do you know the unit price for an item on an invoice? Typically, that item is in a table and one of the columns of that table is labeled "Unit Price" or something similar. Once you read the labels for each column (also called "column headers"), you the reader know where the table begins (below the column headers) and can identify the data in each row (by understanding what the column headers refer to).
This is also the basic idea behind the Tabular Layout Extraction Method. It too utilizes column header labels to "read" tables on documents, or at least as step number one in modeling the table's structure. Once Grooper knows where a column is, identified by the column's header label, Grooper can extract data from each cell in each row of that column.
The Tabular Layout method can establish column header locations in one of two ways:
- Using extractors
- Which are defined on the Data Columns' Header Extractor property (or alternatively on the Data Table's Header Row Extractor property)
- Using Label Sets
- When a Labeling Behavior is enabled, column header locations are defined by labels collected for the Data Columns (and optionally for the Data Table)
- Effectively, the labels take the place of the Header Extractor results (or alternatively the Header Row Extractor results)
Once the column header locations are established, the next thing Grooper needs to do is figure out where each row is. Tabular data is most often dynamic data. A table on one document might have two rows. The same table on the next might have twenty. How does Grooper know where each row is?
This is done by configuring at least one Data Column's Value Extractor property (However, more than one, even all, may be configured. Depending on how complicated the table is, you may need to configure extractors for multiple columns.)
Generally, there is at least one column in a table that is always present for every row in the table. If you can use an extractor to locate that data below its corresponding column header, that gives you a way of finding each row in the table. This allows Grooper to form a "row instance" for each row. Once the row instance is established, Grooper can then collect the various cell values for the various additional columns from the row instance.
If locating column headers and locating rows using column extractors was was all that was involved in Tabular Layout, that alone would make it a powerful tabular extraction method. What makes the Tabular Layout method even more powerful is its further configurability. Is every row in the table a single line or are the rows "multiline"? Do you need more fine-tuned data extraction from a cell's value or the row itself once the row instance is detected? Do you need to establish a table "footer" to limit the number of rows extracted? We will address these issues and more in the #Advanced Setup Considerations section of this article.
| FYI | If your familiar with the Header-Value table extraction method, you should see some similarities between it and the Tabular Layout method. Indeed both methods utilize column headers and Data Column Value Extractors to collect table data.
Tabular Layout should be seen as an improvement on Header-Value for the following reasons:
|
Basic Setup
Tabular Layout can be configured with or without the use of Label Sets. In either case, the basic setup is the same:
- Establish column headers for each Data Column.
- Detect row instances by assigning at least one Data Column's Value Extractor.
- Set the Data Table's Extract Method property to Tabular Layout.
- Test extraction and configure further as necessary.
With Label Sets or without, the setup is extremely similar. On top of that, there's nothing about using Label Sets that alters Tabular Layout's extraction logic. Grooper uses the same logic to model the table's structure and collect data for each cell. The biggest difference is how column headers are determined in step #1.
- Without Label Sets, column headers are established using extractors, defined using the Data Columns' Header Extractor property (or alternatively using the Data Table's Header Row Extractor property)
- With Label Sets, column headers are established using labels, defined when collecting labels for each Document Type. The Data Columns' labels effectively take the place of the Header Extractor property's results.
Tabular Layout Without Label Sets
The Tabular Layout Method
This tutorial will cover the basic configuration of the Tabular Layout method without Label Sets, using extractors to collect column headers instead. We will use invoices for our document set and collect the following data from their tables detailing line item information:
- Item Number - The vendor's id number for the item ordered for each row.
- Description - The description of each item ordered for each row.
- Quantity - The number of the item ordered for each row.
- Unit Price - The vendor's price for the item ordered for each row.
- Line Total - The total price for the number of items ordered (In other words, the quantity ordered multiplied by the unit price)
|
The basic steps will be as follows:
In a perfect world, you're done at that point. As you can see in this example, we've populated a table. Data is collected for all four Data Columns for each row on the document. However, the world is rarely perfect. We will discuss some further configuration considerations to help you get the most out of this table extraction method in the #Advanced Setup Considerations section below. |
1. Configure Header Extractors
As far as strict requirements go for the Tabular Layout method goes, you must at minimum establish column headers for each Data Column you wish to extract.
We'll start with the "Quantity" Data Column.
- FYI: If the invoice lists both a "quantity ordered" and a "quantity shipped" column, we will be collecting the quantity shipped.
|
|
|
|
You would then continue adding variations to the list until all variations of the "Quantity" column's header labels are extracted for every variation of the table.
|
|
Pro Tip: Stacked LabelsYou will often find "stacked labels" in tables. These are multi-word labels broken up across multiple lines in the table's header.
|
|
|
To enable Vertical Wrap:
|
|
Repeat Until All Data Columns Are ConfiguredYou will repeat the same process for each Data Column you want to collect.
|
For our document set, we used the following lists of header column labels:
|
"Item Number" ITEM NO ITEM # Item Number Part Number/Description |
"Description" DESCRIPTION Part Number/Description |
"Quantity" HRS / QTY Quantity Shipped Qty Shp. |
"Unit Price" RATE / PRICE UNIT PRICE |
"Line Total" SUBTOTAL TOTAL Extended Price Ext. Price Ext Price NET AMOUNT |
| FYI | You may have noticed Part Number/Description is present in both the "Item Number" and "Description" columns' header lists.
This can happen. Depending on a table's format, what would normally be divided up between two columns on other documents may be jammed into one. There are methods to account for this using Tabular Layout. We will address this in the #SECTION NAME TBD section of this article. |
2. Assign a Data Column's Value Extractor
This step is all about row detection.
So far all we've done is established header column positions on each document. But, that's not where the data is. The table's data is in the rows.
As it stands, Grooper doesn't know anything about the rows in the tables. It doesn't know the size of each row. It doesn't know what kind of data is supposed to be in the rows. Maybe most importantly, it doesn't know how many rows there are. Tables tend to be dynamic. They may have 3 rows on one document and 300 on the next. Grooper needs a way of detecting this.
|
To detect rows, we need at least one Data Column's Value Extractor property configured. For each result the extractor produces below the column's header, Grooper will create one row instance. The key thing to keep in mind is this data must be present on every row. You'll want to pick a column whos data is always present for every row, where it would be considered invalid if the information wasn't in that cell for a given row. In our case, we will choose the "Quantity" Data Column. We always expect (for the time being anyway) there to be a quantity listed for the line item on the invoice.
|
|||
|
|
|||
|
|
So far, we have:
- Collected labels for the Data Column labels (and optionally the whole row of column labels for the Data Table)
- Configured at least one Data Column with its Value Extractor configured.
For fairly simple table structures, we now have the two things the Tabular Layout method needs to extract data. Now, all we need to do is tell the Data Table object we want to use the Tabular Layout method. We do this by setting its Extract Method property to Tabular Layout.
3. Set Extract Method to Tabular LayoutA Data Table's extraction method is set using the Extract Method property. To enable the Tabular Layout method, do the following.
|
|
4. TestNow, let's test out what we have and see what we get!
|
|
|
First, it's all about establishing column headers.
|
|
|
|
|
|
|
|
|
|
Alternative Configuration: Header Row Extractor
You may alternatively establish column headers for the entire row of header labels, using the Header Row Extractor property. Instead of configuring each Data Column's Header Extractor, you would configure an extractor to return the whole table's row of column headers and use named instances (either Named Groups or child extractors) to establish each Data Column's header.
There are two reasons using a Header Row Extractor can be beneficial:
- It can be a way to throw out false positive column matches.
- It can be a way to better take advantage of Fuzzy RegEx.
| ⚠ | Configuring the Header Row Extractor will override all Data Columns Header Extractors.
You should choose to either establish column headers using the Header Row Extractor or do so using each Data Column's Header Extractors. You may find it beneficial to configure Data Column Header Extractors as the "default" configuration and use Data Element Overrides to pick and choose which Document Types you want to use the Header Row Extractor instead. |
Craft the ExtractorTo configure the Header Row Extractor, you will need to craft an extractor (or multiple extractors for multiple table formats). We will choose to do that first by creating a few Value Reader and Data Types.
DESCRIPTION\t ITEM NO\t HRS / QTY\t PER\t RATE / PRICE\t SUBTOTAL
|
This is only step one. Next, we need some way of breaking up the result into each component column. How does Grooper know what part of the result is the label for the "Description" column or the "Quantity" column? It doesn't until you break up the result into named instances that match the names of your Data Columns in the Data Table. These named instances can either be:
- Named Groups
- Named Child Extractors
Assign Named Instances: Using Named GroupsWhen pattern matching a header row, you can do this with Named Groups.
|
|||
|
The following regex would accomplish this goal in our case. (?<Description>DESCRIPTION)\t (?<Item_Number>ITEM NO)\t (?<Quantity>HRS / QTY)\t PER\t (?<Unit_Price>RATE / PRICE)\t (?<Subtotal>SUBTOTAL)
|
|||
Assign Named Instances: Using Named Child ExtractorsYou may also create and use the named instances by naming a Data Type's child extractors to match the names of your Data Columns.
|
|||
|
|
|||
Assign the Header Row ExtractorNow that we have a couple examples of header row extractors, we can assign them using Tabular Layout's Header Row Extractor property.
|
|||
|
The extractor we referenced was very specifically designed with only one table format in mind. It works for invoices assigned the "Fairdeal" Document Type, but no others.
|
Why Bother?
There are two main reasons why Header Row Extractors can be beneficial:
- To throw out false positive column header matches
- To better match column headers with poor OCR using Fuzzy RegEx.
To Throw Out False PositivesThe first reason to use a Header Row Extractor is to help eliminate false positive column header matches.
|
|
|
|
|
|
A row of header labels tends to be more specific (and requires more specific extraction logic).
|
For Fuzzy RegEx
The other reason to use a Header Row Extractor has to do with imperfect OCR text data and Fuzzy RegEx. Fuzzy RegEx provides a way for regular expression patterns to match in Grooper when the text data doesn't strictly match the pattern. The difference between the regex pattern Grooper and the character string "Gro0per" is just off by a single character. An OCR engine misreading an "o" character for a zero is not uncommon by any means, but a standard regex pattern of Grooper will not match the string "Gro0per". The pattern expects there to be an "o" where there is a zero.
Using Fuzzy RegEx instead of regular regex, Grooper will evaluate the difference between the regex pattern and the string. If it's similar enough (if it falls within a percentage similarity threshold) Grooper will return it as a match.
- FYI "similarity" may also be referred to as "confidence" when evaluating (or scoring) fuzzy match results. Grooper is more or less "confident" the result matches the regex pattern based on the fuzzy regex similarity between the pattern and the imperfect text data. A similarity of 90% and a confidence score of 90% are functionally the same thing (One could argue there is a difference between these two terms when Fuzzy Match Weightings come into play, but that's a whole different topic. And you may encounter Grooper users who use the terms "similarity" and "confidence" interchangeably regardless. Visit the Fuzzy RegEx article if you would like to learn more).
|
|
|
|
We can certainly get this label to match with Fuzzy RegEx, but only at a fairly low similarity.
|
|
|
An entire row of headers, on the other hand, has much more characters in it. The cost to swap a single character in the entire row of headers will be much less, and much more negligible.
|
Tabular Layout With Label Sets
The Tabular Layout Method
This tutorial will cover the basic configuration of the Tabular Layout method with Label Sets, using a Labeling Behavior to collect column headers. We will use invoices for our document set and collect the following data from their tables detailing line item information:
- Item Number - The vendor's id number for the item ordered for each row.
- Description - The description of each item ordered for each row.
- Quantity - The number of the item ordered for each row.
- Unit Price - The vendor's price for the item ordered for each row.
- Line Total - The total price for the number of items ordered (In other words, the quantity ordered multiplied by the unit price)
|
The basic steps will be as follows:
|
In a perfect world, you're done at that point. As you can see in this example, we've populated a table. Data is collected for all four Data Columns for each row on the document.
However, the world is rarely perfect. We will discuss some further configuration considerations to help you get the most out of this table extraction method in the #Advanced Setup Considerations section below.
1. Collect Column Labels
The following tutorial will presume you have general familiarity with collecting labels. See the Label Sets article for a full explanation of how to collect labels for Document Types in a Content Model.
As far as strict requirements go for the Tabular Layout method goes, you must at minimum establish column headers for each Data Column you wish to extract.
We'll start with the "Quantity" Data Column.
- FYI: If the invoice lists both a "quantity ordered" and a "quantity shipped" column, we will be collecting the quantity shipped.
|
|
Best Practice: Collect a Header Row Label for the Data Table
You may optionally collect a label for the entire row of column header labels (aka the "header row label"). This label is collected for the parent Data Table object's label.
|
|
|
Why Bother?There are two main reasons why Header Row Extractors can be beneficial:
| |
To Throw Out False PositivesThe first reason to collect a header row label is to help eliminate false positive column header matches.
|
|
For Fuzzy RegExThe other reason to collect a header row label has to do with imperfect OCR text data and Fuzzy RegEx. Fuzzy RegEx provides a way for regular expression patterns to match in Grooper when the text data doesn't strictly match the pattern. The difference between the regex pattern Using Fuzzy RegEx instead of regular regex, Grooper will evaluate the difference between the regex pattern and the string. If it's similar enough (if it falls within a percentage similarity threshold) Grooper will return it as a match.
| |
|
|
|
|
Not magic. Just math. The Data Table's column header row label is much much longer than a single Data Column's column header label. There are just more characters in
Utilizing a Data Table label allows you to hijack the whole row's similarity score when a single Data Column does not meet the similarity threshold.
|
|
So why is it considered best practice to capture a header row label for the Data Table? OCR errors are unpredictable.
The set of examples you worked with when architecting this solution may have been fairly clean with good OCR reads. Maybe it didn't seem like you needed a Data Table label at the time, but that may not always be the case. Capturing a Data Table label for the header row will act as a safety net to avoid unforeseen problems in the future.
2. Assign a Data Column's Value Extractor
This step is all about row detection.
So far all we've done is established header column positions on each document. But, that's not where the data is. The table's data is in the rows.
As it stands, Grooper doesn't know anything about the rows in the tables. It doesn't know the size of each row. It doesn't know what kind of data is supposed to be in the rows. Maybe most importantly, it doesn't know how many rows there are. Tables tend to be dynamic. They may have 3 rows on one document and 300 on the next. Grooper needs a way of detecting this.
|
To detect rows, we need at least one Data Column's Value Extractor property configured. For each result the extractor produces below the column's header, Grooper will create one row instance. The key thing to keep in mind is this data must be present on every row. You'll want to pick a column whos data is always present for every row, where it would be considered invalid if the information wasn't in that cell for a given row. In our case, we will choose the "Quantity" Data Column. We always expect (for the time being anyway) there to be a quantity listed for the line item on the invoice.
|
|||
|
|
|||
|
|
So far, we have:
- Collected labels for the Data Column labels (and optionally the header row label for the Data Table)
- Configured at least one Data Column with its Value Extractor configured.
For fairly simple table structures, we now have the two things the Tabular Layout method needs to extract data. Now, all we need to do is tell the Data Table object we want to use the Tabular Layout method. We do this by setting its Extract Method property to Tabular Layout.
3. Set Extract Method to Tabular LayoutA Data Table's extraction method is set using the Extract Method property. To enable the Tabular Layout method, do the following.
|
|
4. TestNow, let's test out what we have and see what we get!
|
|
|
First, it's all about establishing column headers.
|
|
|
|
|
|
|
|
|
|
Advanced Setup Considerations
The Tabular Layout method is designed to extract tabular data even with the most basic setup described above.
However, sometimes "basic" just isn't enough. For more complicated table structures, the Tabular Layout method has a robust suite of configurable properties. Understanding these properties will allow you to better extract a wider variety of tabular data.
In this section, we will discus the following advanced setup features for Tabular Layout:
Multiline Rows
|
The table from the previous example had a single-line table structure. Each row occupied one line. Table extraction can get a little trickier when tables have a multiline row structure, especially if sometimes the table rows occupy a single line and sometimes multiple. For example, most the rows on the line items table for the "Arve" Document Type are single-line rows. However, for some rows, the "Description" column spans multiple lines.
|
|
|
This is what the Multiline Rows property is for. Enabling this property will allow you to target table structures like this whose rows extend beyond just a single line on the page.
|
|
|
The Multiline Rows functionality will even detect multiline rows if the lines start on one page and continue to another.
|
Detect Stacked Layout
Advanced Row Detection
Row Detection Using Multiple Columns
The Minimum Cell Count Property
Disabling Row Detection
|
Generally speaking, a "footer" is something on a document indicating the end of a portion of text. It could be the end of a page, end of a paragraph, end of a chapter, or even the end of something like a table. If you can find some sort of text label that signifies the end of the table, you can often avoid extraction errors where the Tabular Layout method overextends, capturing extra rows of "junk data".
We need a way to remove this row. We need some way of telling Grooper this row is invalid. We can do that very easily in this case with a Footer Label.
|
|
|
To do this, we will need to update the "Factura" Document Type's Label Set.
|
|
|
Primary VS Secondary Extraction
|
As we've seen before at least one Data Column must have its Value Extractor property configured. This is (part of) how the Tabular Layout method models the row structure of the table. Grooper needs some piece of information to tell where each row is (and how many rows to the table there are). In other words, it needs at least one Data Column's Value Extractor results to detect each row.
|
|
|
|
|
A Data Column's Value Extractor may also be used for data validation and/or cleansing purposes as well as finding the rows in a table.
|
|
|
If we assign this extractor to the "Line Total" Data Column's Value Extractor, upon extraction the Data Table will collect the extractor's result, ultimately cleansing the imperfect text data and returning the valid currency amount.
|
|
|
Now, with the "Line Total" Data Column utilizing an extractor, we will get a more accurate result.
|
Secondary Extract Modes: Auto VS Cell Extract VS Geometric Extract
Secondary Extract Modes: Row Extract
How and Why to Disable Row Detection (DELETE ME)
Continuing from the example in the previous tab, let's play "What if...?"
What if for rows with a zero dollar "Unit Price" amount and zero dollar "Line Total" amount, those cells were totally blank instead of "0.00"?
If that were the case, by configuring the "Line Total" Data Column's Value Extractor, we actually would end up throwing out the row entirely, preventing us from collecting data we want.
|
Remember, a Data Column's Value Extractor results are used to detect each row in the table. If there's no result, there's no row. Furthermore, by having two Data Columns' Value Extractors configured, both pieces of information are going to be required to find and return the row.
Oh oh! This means this is no longer a valid row according to the default use of the two Data Columns' Value Extractor results. |
|
Because the "Line Total" Data Column now has its Value Extractor configured, it must match a result in order to collect a table row. This is obviously not want we want. There's still data we want on that row, namely the "Quantity" and "Description" column data. |
|
|
Truly, we only want the "Quantity" Data Column to be used for row detection (We're presuming every row has a value for the "Quantity" column). We don't really want or need the "Line Items" Data Column's Value Extractor to detect the rows (The "Quantity" Data Column's Value Extractor is already doing that job). Instead, we want to perform secondary extraction for the "Line Items" column, once a row is already located (and a row instance is established for each row's data). This is what the Tabular Layout' method's Column Settings properties are for. This set of properties will allow you to choose if and how a Data Column's Value Extractor is used for row detection or secondary extraction after a row instance is formed.
|
|
|
From here, you select the Data Column you wish to configure. These settings will allow you to override the default settings for how a Data Column's Value Extractor is used for row detection and secondary extraction.
|
|
FYI |
The Secondary Extract property will control if secondary extraction is performed at all (Always or Never). The Secondary Extract Mode will control how secondary extraction is performed. The Data Column's Value Extractor can run against the cell in the row's text (CellExtract), it can run against the full row's text (RowExtract), or the extractor can be ignored entirely and the text falling within the geometric boundaries of the cell will be returned (Geometric). |
By manipulating the extraction logic of the Data Columns, we have more control over how those results are used for row detection or secondary extraction for data validation and cleansing. |
















































