2021:Tabular Layout (Table Extract Method): Difference between revisions

From Grooper Wiki
Line 209: Line 209:




{|cellpadding="10" cellspacing="5"
|-style="background-color:#f89420; color:white"
|style="font-size:22pt"|⚠||Configuring the '''''Header Row Extractor''''' will ''override'' all '''Data Columns''' '''''Header Extractors'''''. 


You should choose to ''either'' establish column headers using the '''''Header Row Extractor''''' ''or'' do so using each '''Data Column's''' '''''Header Extractors'''''.


You may find it beneficial to configure '''Data Column''' '''''Header Extractors''''' as the "default" configuration and use Data Element Overrides to pick and choose which Document Types you want to use the '''''Header Row Extractor''''' instead.
|}




There are two reasons using a '''''Header Row Extractor''''' can be beneficial:
# It can be a way to throw out false positive column matches.
# It can be a way to better take advantage of Fuzzy RegEx.





Revision as of 09:23, 12 May 2022

WIP This article is a work-in-progress or created as a placeholder for testing purposes. This article is subject to change and/or expansion. It may be incomplete, inaccurate, or stop abruptly.

This tag will be removed upon draft completion.

Tabular Layout is one of Grooper's methods of extracting table data from documents available to Data Table objects (via its Extract Method property). This method uses column header values determined by the Data Columns Header Extractor results (or labels collected for the Data Columns when a Labeling Behavior is enabled) as well as Data Column Value Extractor results to model a table's structure and return its values.

The Tabular Layout is "Label Set aware". You can configure Tabular Layout with or without labels. This article will detail both methods. For more information on Label Sets, please visit the full Label Sets article.


About

Many tables label the columns so the reader knows what the data in that column corresponds to. How do you know the unit price for an item on an invoice? Typically, that item is in a table and one of the columns of that table is labeled "Unit Price" or something similar. Once you read the labels for each column (also called "column headers"), you the reader know where the table begins (below the column headers) and can identify the data in each row (by understanding what the column headers refer to).

This is also the basic idea behind the Tabular Layout Extraction Method. It too utilizes column header labels to "read" tables on documents, or at least as step number one in modeling the table's structure. Once Grooper knows where a column is, identified by the column's header label, Grooper can extract data from each cell in each row of that column.

The Tabular Layout method can establish column header locations in one of two ways:

  1. Using extractors
    • Which are defined on the Data Columns' Header Extractor property (and optionally on the Data Table's Header Row Extractor property)
  2. Using Label Sets
    • When a Labeling Behavior is enabled, column header locations are defined by labels collected for the Data Columns (and optionally for the Data Table)
    • Effectively, the labels take the place of the Header Extractor results (and optionally the Header Row Extractor results)

Once the column header locations are established, the next thing Grooper needs to do is figure out where each row is. Tabular data is most often dynamic data. A table on one document might have two rows. The same table on the next might have twenty. How does Grooper know where each row is?

This is done by configuring at least one Data Column's Value Extractor property (However, more than one, even all, may be configured. Depending on how complicated the table is, you may need to configure extractors for multiple columns.)

Generally, there is at least one column in a table that is always present for every row in the table. If you can use an extractor to locate that data below its corresponding column header, that gives you a way of finding each row in the table. This allows Grooper to form a "row instance" for each row. Once the row instance is established, Grooper can then collect the various cell values for the various additional columns from the row instance.

If locating column headers and locating rows using column extractors was was all that was involved in Tabular Layout, that alone would make it a powerful tabular extraction method. What makes the Tabular Layout method even more powerful is its further configurability. Is every row in the table a single line or are the rows "multiline"? Do you need more fine-tuned data extraction from a cell's value or the row itself once the row instance is detected? Do you need to establish a table "footer" to limit the number of rows extracted? We will address these issues and more in the #Advanced Setup Considerations section of this article.

FYI If your familiar with the Header-Value table extraction method, you should see some similarities between it and the Tabular Layout method. Indeed both methods utilize column headers and Data Column Value Extractors to collect table data.

Tabular Layout should be seen as an improvement on Header-Value for the following reasons:

  1. Tabular Layout is Label Set aware.
  2. Tabular Layout is typically less involved to set up.
  3. Tabular Layout has more configuration options, giving it a better capability to extract data from a large set of disparate table structures (Usually executed through Data Element Overrides).

Basic Setup

Tabular Layout can be configured with or without the use of Label Sets. In either case, the basic setup is the same:

  1. Establish column headers for each Data Column.
  2. Detect row instances by assigning at least one Data Column's Value Extractor.
  3. Set the Data Table's Extract Method property to Tabular Layout.
  4. Test extraction and configure further as necessary.

With Label Sets or without, the setup is extremely similar. On top of that, there's nothing about using Label Sets that alters Tabular Layout's extraction logic. Grooper uses the same logic to model the table's structure and collect data for each cell. The biggest difference is how column headers are determined in step #1.

  • Without Label Sets, column headers are established using extractors, defined using the Data Columns' Header Extractor property (and optionally using the Data Table's Header Row Extractor property)
  • With Label Sets, column headers are established using labels, defined when collecting labels for each Document Type. The Data Columns' labels effectively take the place of the Header Extractor property's results.

Tabular Layout Without Label Sets

The Tabular Layout Method

This tutorial will cover the basic configuration of the Tabular Layout method without Label Sets, using extractors to collect column headers instead. We will use invoices for our document set and collect the following data from their tables detailing line item information:

  • Item Number - The vendor's id number for the item ordered for each row.
  • Description - The description of each item ordered for each row.
  • Quantity - The number of the item ordered for each row.
  • Unit Price - The vendor's price for the item ordered for each row.
  • Line Total - The total price for the number of items ordered (In other words, the quantity ordered multiplied by the unit price)

The basic steps will be as follows:

  1. Establish column headers by configuring the Header Extractor property of each Data Column in the Data Table.
    • You must configure header extractors for each Data Column whose data you want to collect.
    • Optionally, you may configure a Header Row Extractor set on the Data Table.
  2. Assign a Value Extractor for at least one Data Column.
    • For example, we may expect to find a quantity for each item shipped on an invoice, regardless of the vendor. There's always a column with a "Quantity" or "QTY" or "Shipped" or some similar header.
    • Since this data is also present on every row, this will provide the information necessary to find each row in the table.
    • We will discus why you might configure the Value Extractor property on additional Data Columns as well in the #Advanced Setup Considerations section of this article.
  3. Set the Data Table object's Extract Method property to Tabular Layout.
    • And configure any Tabular Layout properties as needed. We will discuss many of these properties, why and how to to use them in the #Advanced Setup Considerations section of this article.
  4. Test to ensure the table's data is collected.

In a perfect world, you're done at that point. As you can see in this example, we've populated a table. Data is collected for all four Data Columns for each row on the document.

However, the world is rarely perfect. We will discuss some further configuration considerations to help you get the most out of this table extraction method in the #Advanced Setup Considerations section below.

1. Configure Header Extractors

As far as strict requirements go for the Tabular Layout method goes, you must at minimum establish column headers for each Data Column you wish to extract.

We'll start with the "Quantity" Data Column.

  • FYI: If the invoice lists both a "quantity ordered" and a "quantity shipped" column, we will be collecting the quantity shipped.


  1. Select the Data Column.
  2. Select the Header Extractor property.
    • Here you will set an extractor to locate the column header on the document for the selected Data Column.
  3. Using the dropdown selector, select the Extractor Type you wish to configure to return the column header.
    • You can use whatever Extractor Type you want to get the job done. You may select Reference to reference a Data Type or Value Reader extractor object you've configured already. Or, you can select one of the other Extractor Types to configure extraction locally.
    • We're going to select List Match.


The List Match extractor is well suited for our purposes here. Ultimately, we will enter a list of various ways a "Quantity" column can be labeled.

  1. For example, this document labels quantities of each item ordered as "HRS / QTY"
  2. So, we've added HRS / QTY to the Local Entries list.
  3. Other documents use the label "Quantity" or "Shipped". So, we've added Quantity and Shipped to the list as well.

You would then continue adding variations to the list until all variations of the "Quantity" column's header labels are extracted for every variation of the table.

  • Or more generally, until a result for the column header is extracted using whatever Extractor Type you've chosen to configure.

Pro Tip: Stacked Labels

You will often find "stacked labels" in tables. These are multi-word labels broken up across multiple lines in the table's header.

  1. For example, this document's "Quantity" column uses "Qty Shp." for its label.
    • This is a stacked label, with "Qty" on one line and "Shp." on another.
  2. We can add "Qty Ship." to our list of header labels.
  3. However, we will not get a result returned for the document.


We can easily resolve this by enabling the Vertical Wrap feature.

  • This feature is only available to the List Match extractor. This is one of the reasons why List Match is so useful for extracting column headers.

To enable Vertical Wrap:

  1. Switch to the "Properties" tab.
  2. Change the Vertical Wrap property to Enabled.
  3. With Vertical Wrap enabled, the extractor is able to match and return items in the list that wrap vertically on multiple lines.
    • In our case, our stacked label "Qty Shp." is now returned.

Repeat Until All Data Columns Are Configured

You will repeat the same process for each Data Column you want to collect.

  1. We want to collect data from all these columns.
  2. So, we've configured each one's Header Extractor property.

For our document set, we collected the following labels for the following columns:

"Item Number"

ITEM NO
ITEM #
Item Number
Part Number/Description

"Description"

DESCRIPTION
Part Number/Description
  • You may have noticed Part Number/Description is present in both the "Item Number" and "Description" columns' header lists.
  • This can happen. Depending on a table's format, what would normally be divided up between two columns on other documents may be jammed into one. There are methods to account for this using Tabular Layout. We will address this in the #SECTION NAME TBD section of this article.

"Quantity"

HRS / QTY
Quantity
Shipped
Qty Shp.

"Unit Price"

RATE / PRICE
UNIT PRICE

"Extended Price"

SUBTOTAL
TOTAL
Extended Price
Ext. Price
Ext Price
NET AMOUNT

Optional/Alternative Configuration: Header Row Extractor

You may optionally establish column headers for the entire row of header labels, using the Header Row Extractor property. Instead of configuring each Data Column's Header Extractor, you would configure an extractor to return the whole table's row of column headers and use named instances (either Named Groups or child extractors) to establish each Data Column's header.


Configuring the Header Row Extractor will override all Data Columns Header Extractors.

You should choose to either establish column headers using the Header Row Extractor or do so using each Data Column's Header Extractors.

You may find it beneficial to configure Data Column Header Extractors as the "default" configuration and use Data Element Overrides to pick and choose which Document Types you want to use the Header Row Extractor instead.


There are two reasons using a Header Row Extractor can be beneficial:

  1. It can be a way to throw out false positive column matches.
  2. It can be a way to better take advantage of Fuzzy RegEx.





The answer has to do with imperfect OCR text data and Fuzzy RegEx. Fuzzy RegEx provides a way for regular expression patterns to match in Grooper when the text data doesn't strictly match the pattern. The difference between the regex pattern Grooper and the character string "Gro0per" is just off by a single character. An OCR engine misreading an "o" character for a zero is not uncommon by any means, but a standard regex pattern of Grooper will not match the string "Gro0per". The pattern expects there to be an "o" where there is a zero.

Using Fuzzy RegEx instead of regular regex, Grooper will evaluate the difference between the regex pattern and the string. If it's similar enough (if it falls within a percentage similarity threshold) Grooper will return it as a match.

  • FYI "similarity" may also be referred to as "confidence" when evaluating (or scoring) fuzzy match results. Grooper is more or less confident the result matches the regex pattern based on the fuzzy regex similarity between the pattern and the imperfect text data. A similarity of 90% and a confidence score of 90% are functionally the same thing (One could argue there is a difference between these two terms when Fuzzy Match Weightings come into play, but that's a whole different topic. And you may encounter Grooper users who use the terms "similarity" and "confidence" interchangeably regardless. Visit the Fuzzy RegEx article if you would like to learn more).

So how does this apply to the Data Table's column header row label? The short answer is it provides a way to increase the accuracy of Data Column column header labels by "boosting" the similarity of the label to imperfect OCR results.

  1. For example, examine the collected label for the "Description" Data Column.
    • Notice the label Description is highlighted red. The label doesn't match the text on the document.
  2. This is due to imperfect OCR results.
    • The label should read "Description" but OCR made some missteps and recognized that segment as "DescripUon".
    • The "ti" in "Description" were recognized as a capital "U". This means "Description" is two characters different from "Description" or roughly 82% similar. The Labeling Behavior's similarity threshold is set to 90% for this Content Model. 81% is less than 90%. So, the result is thrown out.
      • FYI, this threshold is configured when the Labeling Behavior is added using the Behaviors property of a Content Model. The Label Similarity property is set to 90% by default, but can be adjusted at any time by configuring the Labeling Behavior item in the Behaviors list.

As we will see, capturing the full row of column header labels will boost the similarity, allowing the label to match without altering the Label Behavior's fuzzy match settings.

First, notice what's happened when we lassoed the row of column header labels.

  1. Some of the labels are off. "oty." should read "Qty." and "DescripUon" should read "Description".
  2. It's because that's what's in the document's text. When you lasso a label, it's going to grab whatever OCR text data was generated from the Recognize activity (or native text for digital documents).
  3. And, our "Description" Data Field's label still isn't matching.
    • But keep your eye on the birdie.

  1. Notice what happens when we spell-correct the lassoed label, typing "Qty." instead of "oty." and "Description" instead of "DescripUon".
  2. Now the label matches. MAGIC!

Not magic. Just math.

The Data Table's column header row label is much much longer than a single Data Column's column header label. There are just more characters in "Qty. Qty. Item Number Description Unit Price Extended Price\r\nOrd. Shp." than "Description" (70 vs 11). Where the "Description" Data Column's label is roughly 82% similar to the text data (9 out of 11 characters), the "Line Item" Data Table's label, comprised of the whole row of column labels, is roughly 96% similar to the text data (67 out of 70 characters).

Utilizing a Data Table label allows you to hijack the whole row's similarity score when a single Data Column does not meet the similarity threshold. If the label can be matched as a part of the larger whole, its confidence score goes up much further than by itself. The Data Table's larger label of the full row of column labels gives extra context to the "Description" Data Column's label, providing more information about what is and is not an appropriate match.

So why is it considered best practice to capture a label for the Data Table? OCR errors are unpredictable. The set of examples you worked with when architecting this solution may have been fairly clean with good OCR reads. That may not always be the case. Capturing a Data Table label for the column label row will act as a safety net to avoid unforeseen problems in the future.

Assign a Data Column's Value Extractor

Step 1 is done. We've collected labels for the "Line Item" Data Table and its Data Columns for each Document Type in this Content Model. Step 2 is configuring and assigning a Value Extractor for at least one Data Column.

Why is this necessary? Think about what we've done so far. We've collected labels for the Data Columns. Grooper now has a way to figure out where the columns are on the document. But what does it know about the rows?

Rows come under columns. We know that much. So, Grooper at least knows to look for rows underneath the collected Data Column labels. But that's about it. It doesn't know the size of each row. It doesn't know the spacing between the rows. Probably most importantly, it doesn't know how many rows there are. Tables tend to be dynamic. They may have 3 rows on one document and 300 on the next. Grooper needs a way of detecting this.

Indeed, if we were to test extraction with just labels collected, we would not get any result whatsoever.

  1. FYI you can test data extraction directly from the Labels UI using the "Test" button.
  2. This will create a new "Results" tab, showing you a preview of the results the Extract activity collects from the selected document folder, as defined by its Document Type's Data Model.
  3. As you can see, we get no extraction results for the "Line Item" Data Table.

This is why we need a Data Column's Value Extractor property configured, to give the Extract activity an awareness of the rows beneath the column labels.

The key thing to keep in mind is this data must be present on every row. You'll want to pick a column whos data is always present for every row, where it would be considered invalid if the information wasn't in that cell for a given row.

In our case, we will choose the "Quantity" Data Column. We always expect there to be a quantity listed for the line item on the invoice, even if that quantity is just "1".

  1. We will select the "Quantity" Data Column in the Node Tree.
  2. We will configure the Value Extractor to return the numerical quantity listed for every line item on every row of the table.
    • We will keep this fairly simple for demonstration purposes, using a Pattern Match extractor.

This is the pattern we will use for the "Quantity" Data Column's Value Extractor.

  1. The regex is a fairly simple pattern to match generic quantities.
    • It'll match one to three digits with an optional decimal followed by zero to four digits. And, that must be surrounded by a space character before and after.
  2. As you can see, we get two results below the "Quantity" label. We should then get two rows when this table extracts.

We get a bunch of other hits as well. This is a very generic extractor matching very generic numerical data.

  1. Will this result present a problem? Will we get an extra row for its result?
    • No. That result is above the label collected for the Data Column. The Tabular Layout method presumes rows are below column labels. Any result above them will be ignored.
  2. What about results like these? Will this present problem?
    • The short answer is no. This result is misaligned with the "Quantity" Data Column's header. It's too far to the right to be considered "under" it and will be ignored as a candidate to produce a row.
    • That said, when you are building your own Data Column extractors, do pay more attention to results below the column header row. They have the most potential to produce false positive results, producing erroneous rows.

For fairly simple table structures we now have the two things the Tabular Layout method needs to extract data.

So far, we have:

  1. Collected labels for the Data Column labels (and optionally the whole row of column labels for the Data Table)
  2. Configured at least one Data Column with its Value Extractor configured.

Now, all we need to do is tell the Data Table object we want to use the Tabular Layout method. We do this by setting its Extract Method property to Tabular Layout.

Set Extract Method to Tabular Layout and Test

A Data Table's extraction method is set using the Extract Method property. To enable the Tabular Layout method, do the following.

  1. Select a Data Table object in your Data Model.
    • Here, we've selected the "Line Items" Data Table.
  2. Select the Extract Method property.
  3. Using the dropdown menu, select Tabular Layout

Now, let's test out what we have and see what we get!

  1. For the selected document folder in the "Batch Viewer" window...
  2. Press the "Test Extraction" button.
    • Side note: We've seen before we can test extraction using the "Labels" tab of a Content Model or Document Type when Labeling Behavior is enabled. The only real difference is we're testing extraction for the specific Data Element selected in the Node Tree. In this case the "Line Items" Data Model. The "Test" button in the "Labels" tab will test extraction for the entire Data Model and all its component child Data Elements. However, feel free to test extraction at either location. The end result is the same. We're testing to verify extraction results.
  3. The results show up in the "Data Element Preview" window.

For the Tabular Layout method, the Data Table is populated using primarily two pieces of information.

  1. The location and width of the Data Column header labels.
    • This determines the width of the cells for each column.
    • Side note: The width of the column cells is actually determined differently depending on if the table has lines. If the table has lines (as it does in this example) and those lines were previously detected via a Line Detection (or Line Removal) IP Command, the cell width will be expanded to the boundaries of the lines. Table lines give human readers an indicator of where the data "lives" (or is contained). If it's in the box, it belongs to the column. If it's out of the box, it belongs to a different column.
  2. The number of rows as determined by the Data Columns whose Value Extractor property is configured.
    • One row is established for each result the Value Extractor returns.

With these pieces of information, the Tabular Layout method can start to determine the table's structure. If you know where the columns are and how big they are, and you know how many rows there are, you pretty much know what the table looks like.

This allows Grooper to create data instances for each cell in the table.

  1. Once the Tabular Layout method establishes the boundaries of each cell, Grooper "knows" where the table data is located on the page.
  2. The text data (either OCR'd text or native digital text obtained from the Recognize activity) is extracted from each cell instance, populating the Data Table and collecting these results when the Extract activity runs.
    • This is for extremely basic configurations, there are some more advanced configuration techniques to either adjust the size of the cell instances and/or extract data for each cell. Some of these will be discussed in the #Additional Considerations section below.

Click here to return to the top

Tabular Layout With Label Sets

Label Sets and Tabular Layout

This tutorial will cover the basic configuration of the Tabular Layout method using collected Label Sets. We will use invoices for our document set and collect the following data from their tables detailing line item information:

  • Description - The description of each item ordered for each row.
  • Quantity - The number of the item ordered for each row.
  • Unit Price - The vendor's price for the item ordered for each row.
  • Line Total - The total price for the number of items ordered (In other words, the quantity ordered multiplied by the unit price)

The basic steps will be as follows:

  1. Collect labels.
    • At minimum you must collect a header label for each Data Column child in the Data Table.
    • We will also discus the benefits of collecting label for the full header row.
  2. Assign a Value Extractor for at least one Data Column.
    • We always expect to find a quantity for each line item in the invoice. There's always a "Quantity" column. This data is also present on every row. This will provide the information necessary to find each row in the table.
    • We will also discus why you might configure the Value Extractor property on additional Data Columns as well.
  3. Set the Data Table object's Extract Method property to Tabular Layout
  4. Test to ensure the table's data is collected.

In a perfect world, you're done at that point. As you can see in this example, we've populated a table. Data is collected for all four Data Columns for each row on the document.

However, the world is rarely perfect. We will discuss some further configuration considerations to help you get the most out of this table extraction method in the "Additional Considerations" section below.

Click me to return to the top

Collect Labels

See the above how to (Collect Label Sets) for a full explanation of how to collect labels for Document Types in a Content Model. The following tutorial will presume you have general familiarity with collecting labels.

As far as strict requirements for collecting labels for tabular data extraction goes, you must at minimum collect a label for each Data Column you wish to extract.

For this "Stuff and Things" Document Type, one column header label has been collected for each of the four Data Column children of the "Line Items" Data Table.

  1. The label Quantity for the "Quantity" Data Column
  2. The label Description for the "Description" Data Column
  3. The label Unit Price for the "Unit Price" Data Column
  4. The label Total for the "Line Total" Data Column

You may optionally collect a label for the entire row of column header labels. This label is collected for the parent Data Table object's label.

  1. The label Quantity Item Serial Number Description Unit Price Total for the "Line Items" Data Table

It is generally considered best practice to capture a header row label for the Data Table. But if it's optional, why do it? What is the benefit of this label?

The answer has to do with imperfect OCR text data and Fuzzy RegEx. Fuzzy RegEx provides a way for regular expression patterns to match in Grooper when the text data doesn't strictly match the pattern. The difference between the regex pattern Grooper and the character string "Gro0per" is just off by a single character. An OCR engine misreading an "o" character for a zero is not uncommon by any means, but a standard regex pattern of Grooper will not match the string "Gro0per". The pattern expects there to be an "o" where there is a zero.

Using Fuzzy RegEx instead of regular regex, Grooper will evaluate the difference between the regex pattern and the string. If it's similar enough (if it falls within a percentage similarity threshold) Grooper will return it as a match.

  • FYI "similarity" may also be referred to as "confidence" when evaluating (or scoring) fuzzy match results. Grooper is more or less confident the result matches the regex pattern based on the fuzzy regex similarity between the pattern and the imperfect text data. A similarity of 90% and a confidence score of 90% are functionally the same thing (One could argue there is a difference between these two terms when Fuzzy Match Weightings come into play, but that's a whole different topic. And you may encounter Grooper users who use the terms "similarity" and "confidence" interchangeably regardless. Visit the Fuzzy RegEx article if you would like to learn more).

So how does this apply to the Data Table's column header row label? The short answer is it provides a way to increase the accuracy of Data Column column header labels by "boosting" the similarity of the label to imperfect OCR results.

  1. For example, examine the collected label for the "Description" Data Column.
    • Notice the label Description is highlighted red. The label doesn't match the text on the document.
  2. This is due to imperfect OCR results.
    • The label should read "Description" but OCR made some missteps and recognized that segment as "DescripUon".
    • The "ti" in "Description" were recognized as a capital "U". This means "Description" is two characters different from "Description" or roughly 82% similar. The Labeling Behavior's similarity threshold is set to 90% for this Content Model. 81% is less than 90%. So, the result is thrown out.
      • FYI, this threshold is configured when the Labeling Behavior is added using the Behaviors property of a Content Model. The Label Similarity property is set to 90% by default, but can be adjusted at any time by configuring the Labeling Behavior item in the Behaviors list.

As we will see, capturing the full row of column header labels will boost the similarity, allowing the label to match without altering the Label Behavior's fuzzy match settings.

First, notice what's happened when we lassoed the row of column header labels.

  1. Some of the labels are off. "oty." should read "Qty." and "DescripUon" should read "Description".
  2. It's because that's what's in the document's text. When you lasso a label, it's going to grab whatever OCR text data was generated from the Recognize activity (or native text for digital documents).
  3. And, our "Description" Data Field's label still isn't matching.
    • But keep your eye on the birdie.

  1. Notice what happens when we spell-correct the lassoed label, typing "Qty." instead of "oty." and "Description" instead of "DescripUon".
  2. Now the label matches. MAGIC!

Not magic. Just math.

The Data Table's column header row label is much much longer than a single Data Column's column header label. There are just more characters in "Qty. Qty. Item Number Description Unit Price Extended Price\r\nOrd. Shp." than "Description" (70 vs 11). Where the "Description" Data Column's label is roughly 82% similar to the text data (9 out of 11 characters), the "Line Item" Data Table's label, comprised of the whole row of column labels, is roughly 96% similar to the text data (67 out of 70 characters).

Utilizing a Data Table label allows you to hijack the whole row's similarity score when a single Data Column does not meet the similarity threshold. If the label can be matched as a part of the larger whole, its confidence score goes up much further than by itself. The Data Table's larger label of the full row of column labels gives extra context to the "Description" Data Column's label, providing more information about what is and is not an appropriate match.

So why is it considered best practice to capture a label for the Data Table? OCR errors are unpredictable. The set of examples you worked with when architecting this solution may have been fairly clean with good OCR reads. That may not always be the case. Capturing a Data Table label for the column label row will act as a safety net to avoid unforeseen problems in the future.

Click me to return to the top

Assign a Data Column's Value Extractor

Step 1 is done. We've collected labels for the "Line Item" Data Table and its Data Columns for each Document Type in this Content Model. Step 2 is configuring and assigning a Value Extractor for at least one Data Column.

Why is this necessary? Think about what we've done so far. We've collected labels for the Data Columns. Grooper now has a way to figure out where the columns are on the document. But what does it know about the rows?

Rows come under columns. We know that much. So, Grooper at least knows to look for rows underneath the collected Data Column labels. But that's about it. It doesn't know the size of each row. It doesn't know the spacing between the rows. Probably most importantly, it doesn't know how many rows there are. Tables tend to be dynamic. They may have 3 rows on one document and 300 on the next. Grooper needs a way of detecting this.

Indeed, if we were to test extraction with just labels collected, we would not get any result whatsoever.

  1. FYI you can test data extraction directly from the Labels UI using the "Test" button.
  2. This will create a new "Results" tab, showing you a preview of the results the Extract activity collects from the selected document folder, as defined by its Document Type's Data Model.
  3. As you can see, we get no extraction results for the "Line Item" Data Table.

This is why we need a Data Column's Value Extractor property configured, to give the Extract activity an awareness of the rows beneath the column labels.

The key thing to keep in mind is this data must be present on every row. You'll want to pick a column whos data is always present for every row, where it would be considered invalid if the information wasn't in that cell for a given row.

In our case, we will choose the "Quantity" Data Column. We always expect there to be a quantity listed for the line item on the invoice, even if that quantity is just "1".

  1. We will select the "Quantity" Data Column in the Node Tree.
  2. We will configure the Value Extractor to return the numerical quantity listed for every line item on every row of the table.
    • We will keep this fairly simple for demonstration purposes, using a Pattern Match extractor.

This is the pattern we will use for the "Quantity" Data Column's Value Extractor.

  1. The regex is a fairly simple pattern to match generic quantities.
    • It'll match one to three digits with an optional decimal followed by zero to four digits. And, that must be surrounded by a space character before and after.
  2. As you can see, we get two results below the "Quantity" label. We should then get two rows when this table extracts.

We get a bunch of other hits as well. This is a very generic extractor matching very generic numerical data.

  1. Will this result present a problem? Will we get an extra row for its result?
    • No. That result is above the label collected for the Data Column. The Tabular Layout method presumes rows are below column labels. Any result above them will be ignored.
  2. What about results like these? Will this present problem?
    • The short answer is no. This result is misaligned with the "Quantity" Data Column's header. It's too far to the right to be considered "under" it and will be ignored as a candidate to produce a row.
    • That said, when you are building your own Data Column extractors, do pay more attention to results below the column header row. They have the most potential to produce false positive results, producing erroneous rows.

For fairly simple table structures we now have the two things the Tabular Layout method needs to extract data.

So far, we have:

  1. Collected labels for the Data Column labels (and optionally the whole row of column labels for the Data Table)
  2. Configured at least one Data Column with its Value Extractor configured.

Now, all we need to do is tell the Data Table object we want to use the Tabular Layout method. We do this by setting its Extract Method property to Tabular Layout.

Click me to return to the top

Set Extract Method to Tabular Layout and Test

A Data Table's extraction method is set using the Extract Method property. To enable the Tabular Layout method, do the following.

  1. Select a Data Table object in your Data Model.
    • Here, we've selected the "Line Items" Data Table.
  2. Select the Extract Method property.
  3. Using the dropdown menu, select Tabular Layout

Now, let's test out what we have and see what we get!

  1. For the selected document folder in the "Batch Viewer" window...
  2. Press the "Test Extraction" button.
    • Side note: We've seen before we can test extraction using the "Labels" tab of a Content Model or Document Type when Labeling Behavior is enabled. The only real difference is we're testing extraction for the specific Data Element selected in the Node Tree. In this case the "Line Items" Data Model. The "Test" button in the "Labels" tab will test extraction for the entire Data Model and all its component child Data Elements. However, feel free to test extraction at either location. The end result is the same. We're testing to verify extraction results.
  3. The results show up in the "Data Element Preview" window.

For the Tabular Layout method, the Data Table is populated using primarily two pieces of information.

  1. The location and width of the Data Column header labels.
    • This determines the width of the cells for each column.
    • Side note: The width of the column cells is actually determined differently depending on if the table has lines. If the table has lines (as it does in this example) and those lines were previously detected via a Line Detection (or Line Removal) IP Command, the cell width will be expanded to the boundaries of the lines. Table lines give human readers an indicator of where the data "lives" (or is contained). If it's in the box, it belongs to the column. If it's out of the box, it belongs to a different column.
  2. The number of rows as determined by the Data Columns whose Value Extractor property is configured.
    • One row is established for each result the Value Extractor returns.

With these pieces of information, the Tabular Layout method can start to determine the table's structure. If you know where the columns are and how big they are, and you know how many rows there are, you pretty much know what the table looks like.

This allows Grooper to create data instances for each cell in the table.

  1. Once the Tabular Layout method establishes the boundaries of each cell, Grooper "knows" where the table data is located on the page.
  2. The text data (either OCR'd text or native digital text obtained from the Recognize activity) is extracted from each cell instance, populating the Data Table and collecting these results when the Extract activity runs.
    • This is for extremely basic configurations, there are some more advanced configuration techniques to either adjust the size of the cell instances and/or extract data for each cell. Some of these will be discussed in the #Additional Considerations section below.

Click me to return to the top

Advanced Setup Considerations

Multiline Rows

The table from the previous example had a single-line table structure. Each row occupied one line. Table extraction can get a little trickier when tables have a multiline row structure, especially if sometimes the table rows occupy a single line and sometimes multiple.

For example, most the rows on the line items table for the "Arve" Document Type are single-line rows. However, for some rows, the "Description" column spans multiple lines.

  1. We have selected the "Line Items" Data Table in our Node Tree.
  2. The Extract Method is set to Tabular Layout.
    • The default settings are configured. The labels were collected according to the previous tutorials.
  3. With just the default settings, Grooper presumes your tables have a single-line structure. This works fine until you come across rows like this. Grooper only extracts the first line in the "Description" column.
  4. The rest of the description on the second line in this case is not extracted.

This is what the Multiline Rows property is for. Enabling this property will allow you to target table structures like this whose rows extend beyond just a single line on the page.

  1. Here, we have enabled the Multiline Rows property by switching it from Disabled to Enabled.
  2. The Tabular Layout method now appropriately detects this row as occupying multiple lines on the document.
  3. The full line item description is now property extracted by this Data Table.

The Multiline Rows functionality will even detect multiline rows if the lines start on one page and continue to another.

  1. Such is the case here where the row starts at the bottom of the first page.
  2. However, the "Description" column continues to the next page.
  3. All four rows are collected when the Data Table extracts the document.

Detect Stacked Layout

Advanced Row Detection

Row Detection Using Multiple Columns

The Minimum Cell Count Property

Disabling Row Detection

Footer Labels

Generally speaking, a "footer" is something on a document indicating the end of a portion of text. It could be the end of a page, end of a paragraph, end of a chapter, or even the end of something like a table.

If you can find some sort of text label that signifies the end of the table, you can often avoid extraction errors where the Tabular Layout method overextends, capturing extra rows of "junk data".

  1. Such is the case for the "Factura" Document Types.
  2. The Data Table properly extracts the first three rows but also extracts an extra one erroneously.
  3. It extended the table way down to form a row on the bottom of the page.
    • All the technical requirements to form a row were satisfied. Remember, we set a Value Extractor for the "Quantity" Data Column in order to form the row instances of this table. The "91" in the address line "91 Vahlen Plaza" just happens to be below the label for the "Quantity" Data Column for this Document Type.

We need a way to remove this row. We need some way of telling Grooper this row is invalid.

We can do that very easily in this case with a Footer Label.

  1. We would not expect there to be a row after this text. Every valid row for this Document Type should be above the words "Due Date" and "Amount Due".
    • We will add a Footer Label to the "Line Items" Data Table's labels to force the Tabular Layout method to quit populating table rows after this segment of text.
    • It indicates a hard stopping point. Do not pass footer. Do not collect 200 extra rows.

To do this, we will need to update the "Factura" Document Type's Label Set.

  1. Navigate to the Content Model with Labeling Behavior enabled in the Node Tree.
  2. Switch to the "Labels" tab.
  3. In the "Batch Selector", select the Document Type whose Label Set you wish to edit.
    • In our case, the "Factura" Document Type.
    • Remember, the document folder you select must be classified. It must have the Document Type assigned to it (even if you do so manually using the "Set Type..." button).
  4. Select the Data Element whose labels you wish to edit.
    • In our case we want to add a Footer label to aid in the Tabular Layout extraction method's table extraction. We will need to select the "Line Items" Data Table.
  5. Select the Footer tab.
  6. Collect the label.

  1. When we test extraction for this Data Table now, we will get a different (and better) result.
  2. With just that Footer label added and no other configuration. The Tabular Layout method now extracts the table data correctly. The superfluous row is no longer returned.
  3. Any potential rows below the Footer label will not be collected for the Data Table.

Capture Footer Row VS Display Total Row

Primary VS Secondary Extraction

As we've seen before at least one Data Column must have its Value Extractor property configured. This is (part of) how the Tabular Layout method models the row structure of the table. Grooper needs some piece of information to tell where each row is (and how many rows to the table there are).

In other words, it needs at least one Data Column's Value Extractor results to detect each row.

  1. In our Content Model we've been working with, we configured the "Quantity" Data Column's Value Extractor.
  2. The Value Extractor is set to reference a Data Type extractor returning simple quantities.
  3. One row is produced for each result encountered below the label for the "Quantity" Data Column in the Document Type's Label Set.

  1. As we've seen before, we can select the parent Data Table object and test extraction.
  2. Sure enough, we return three rows for the line items table, one for each result the "Quantity" Data Column's Value Extractor returned.
  3. However, check out the results in the "Line Total" column.
    • These are not valid currencies. The result "40.700.00" should be "40,700.00" with a comma instead of a decimal.

A Data Column's Value Extractor may also be used for data validation and/or cleansing purposes as well as finding the rows in a table.

  1. Here, we have selected a Data Type extractor that returns currency values using fuzzy matching.
  2. Notice how this extractor returns this result.
  3. Rather than returning the text data, "40.700.00" it returns "40,700.00" with a confidence score of 98%.
    • The fuzzy match swapped the decimal for a comma. Now we have a valid currency amount.

If we assign this extractor to the "Line Total" Data Column's Value Extractor, upon extraction the Data Table will collect the extractor's result, ultimately cleansing the imperfect text data and returning the valid currency amount.

  1. Here, we have selected the "Line Total" Data Column in the Node Tree.
  2. For the Value Extractor property, we have selected Reference.
  3. Using the Extractor sub-property we have pointed to the Data Type we were looking at earlier, which will using fuzzy matching to return currency values.

Now, with the "Line Total" Data Column utilizing an extractor, we will get a more accurate result.

  1. Here, we've gone back up to the "Line Items" Data Table to test extraction.
  2. We're still getting three rows, one for each line of the table on the document.
  3. And we're getting accurate values for the "Line Item" column.
    • The bad text data "40.700.00" was cleansed to "40,700.00"
    • We used an extractor using fuzzy matching to return currency values. The decimal was swapped for a comma. (For more information on fuzzy matching, visit the Fuzzy RegEx article)

Secondary Extract Modes: Auto VS Cell Extract VS Geometric Extract

Secondary Extract Modes: Row Extract

How and Why to Disable Row Detection (DELETE ME)

Continuing from the example in the previous tab, let's play "What if...?"

What if for rows with a zero dollar "Unit Price" amount and zero dollar "Line Total" amount, those cells were totally blank instead of "0.00"?

If that were the case, by configuring the "Line Total" Data Column's Value Extractor, we actually would end up throwing out the row entirely, preventing us from collecting data we want.

Remember, a Data Column's Value Extractor results are used to detect each row in the table. If there's no result, there's no row.

Furthermore, by having two Data Columns' Value Extractors configured, both pieces of information are going to be required to find and return the row.

  1. Here, we have an extractible value for the "Quantity" Data Column's Value Extractor.
  2. However, we do not for the "Line Item" Data Column's Value Extractor.
    • There is just blank space. There is no longer a currency value for the extractor to match on this row.

Oh oh! This means this is no longer a valid row according to the default use of the two Data Columns' Value Extractor results.

  1. Upon testing extraction, the "Line Item" Data Table only collects two rows, instead of three.
  2. The entire second row of this table is thrown out.

Because the "Line Total" Data Column now has its Value Extractor configured, it must match a result in order to collect a table row.

This is obviously not want we want. There's still data we want on that row, namely the "Quantity" and "Description" column data.

Truly, we only want the "Quantity" Data Column to be used for row detection (We're presuming every row has a value for the "Quantity" column). We don't really want or need the "Line Items" Data Column's Value Extractor to detect the rows (The "Quantity" Data Column's Value Extractor is already doing that job). Instead, we want to perform secondary extraction for the "Line Items" column, once a row is already located (and a row instance is established for each row's data).

This is what the Tabular Layout' method's Column Settings properties are for. This set of properties will allow you to choose if and how a Data Column's Value Extractor is used for row detection or secondary extraction after a row instance is formed.

  1. Expand the Tabular Layout sub-properties.
  2. Select the Column Settings property and press the ellipsis button at the end.
  3. This will bring up a collection editor to manipulate the Column Settings properties for each Data Column.

From here, you select the Data Column you wish to configure. These settings will allow you to override the default settings for how a Data Column's Value Extractor is used for row detection and secondary extraction.

  1. In our case, we're selecting the "Line Total" Data Column.
  2. We've changed the Row Detection property to Disabled.
    • This will prevent the "Line Total" Data Column's Value Extractor from being used for row detection. Our table rows will not be thrown out if it fails to produce a result. Instead, the extractor will be used for secondary extraction against the column's cell.
  3. Press "OK" when finished.

FYI

The Secondary Extract property will control if secondary extraction is performed at all (Always or Never).

The Secondary Extract Mode will control how secondary extraction is performed. The Data Column's Value Extractor can run against the cell in the row's text (CellExtract), it can run against the full row's text (RowExtract), or the extractor can be ignored entirely and the text falling within the geometric boundaries of the cell will be returned (Geometric).

  • The default of Auto will execute the CellExtract mode first, then Geometric if it fails.
  • In other words, it will attempt to run the Data Column's Value Extractor against the cell first. But, if that extractor fails, returning no result, whatever text data falls within the cell's boundary is returned.
    1. With these Column Settings updated, we will see a different (better) result.
    2. Upon testing extraction, the second row of the table is now properly collected.
    3. Because, the "Line Total" Data Column's Value Extractor is no longer used for row detection.
    4. But, it's still used for secondary extraction against the table cell, returning "40,700.00" instead of "40.700.00".

    By manipulating the extraction logic of the Data Columns, we have more control over how those results are used for row detection or secondary extraction for data validation and cleansing.