2.80:Table Extraction (Concept): Difference between revisions

Revision as of 10:35, 10 January 2020

Data in an Excel spreadsheet is an example of tabular data.

Tables are one of the most common ways data is organized on documents. Human beings have been writing information into tables before they started writing literature, even before paper was invented. There are examples of tables carved onto the walls of Egyptian temples! They are excellent structures for representing a lot of information with various characteristics in common in a relatively small space (or an Egyptian temple sized space). However, targeting the data inside them presents its own set of challenges. A table’s structure can range from simple and straightforward to more complex (even confounding). Different organizations may organize the same data differently, creating different tables for what, essentially, is the same data.

In Grooper, tabular data can be extracted using the Row Match, Header-Value, or Infer Grid table extraction methods.

What Is a Table?

Tables consists of rows and columns. Where those rows and columns intersect are cells. These are the individual units of the table containing individual pieces of data. Each row consists of the same number of columns (although some columns may be empty in a given row). A single column consists of the same type of information. For example, an "Order Date" column will always have dates in the cells below it. The rows themselves are usually (but not always) in some order as well, such as in order of ascending date.

This may seem obvious, but understanding how data is structured on the page informs how you will use Grooper to target it.

What is a Table in Grooper?

In Grooper, tables are represented as Data Table objects in a Data Model. Each column is represented as a Data Column object, created as a children of the Data Table. Rows and their individual cells are created and populated upon successful data extraction.

To the left is a Content Model in the Node Tree. It contains a Data Table with Data Columns, representing the table's structure.

How raw text data is targeted and extracted to populate each row in the table is determined by the Data Table object's "Extract Method" property. This can be done using either the Row Match, the Header-Value, or the Infer Grid method.

Once the Data Table and its Data Columns are configured (according to the Extract Method), Grooper populates the table rows and extracts data to each cell.

Use Cases

Tables are used by every organization in an innumerable number of business spaces in limitless ways. Examples of tabular data can be found in...

Even data which may not initially seem like a table can be represented as a table. For example, an email inbox is essentially one big table of information pertaining to messages sent to an email address. It has columns like "From", "Received Date", "Subject" and more for rows of messages.

In fact, the main benefit to putting data in a table is that you can easily encapsulate repeated instances of data with multiple similar characteristics. Every row is just a collection of related data sharing characteristics defined by each column. Even if that information isn't presented in a table-like structure, sometimes Grooper can use the same table extraction methods to target the data and format it into a table.

The document below is a list of different contract language for different types of clauses. The information is not presented in a table but it does have some similar qualities to tabular data. It has a repeatable sections of information each containing similar pieces of information, in this case at least the type of clause, the language of the clause, and the number of contracts this language appears.

With clever configuration of the Row Match table extract method, this information can be extracted into a table in Grooper, seen below.

Tables and Database Tables

Once Grooper has extracted all this information from tables on a document set, what do you do with it? A database table is the perfect location for extracted table data on a document. All information collected from a document's table can be exported to a SQL database (or any ODBC compliant database) from Grooper. Once a connection to a SQL database is established, you can even create a database table directly from Grooper using the Data Table and Data Columns in your Content Model. Then, it's just a matter of mapping the Data Columns from the Grooper Data Table to the SQL table columns. This creates a logical connection where all the extracted information in Column A from the Grooper Data Table gets put into the corresponding Column A of the SQL database table.

The original document	The Grooper extracted table	The data exported to a SQL database

Table Extract Methods Overview

There are three different methods to extract data from tables. They are set on the "Extract Method" property of the Data Table object. Each method has its own benefits and limitations. The three methods are as follows:

Row Match
Header-Value
Infer Grid

Row Match

The Row Match method uses regular expression pattern matching to find each row of the table. This can be done in a number of different ways, but the general idea is to use an extractor that matches the structure of each row. For this table, a pattern or series of patterns would be written to find a date, followed by five digits, followed by a name, followed by a one to two digit number, followed by a dollar amount.

Row Match was the first table extraction method available in Grooper. It is the simplest to configure and is often the first "go-to" method for Grooper designers. And, if it can't produce results, one of the other methods is used instead.

This method is useful for fairly simple tables without much variation across a document set. If you are processing a large number of the same report (or even very similar reports in some cases) over and over where the table structure (column positions and cell formatting) remains consistent, Row Match is usually the best way to target and extract tabular data. It's easy to set up (often just using a single regular expression pattern). There are fewer objects to create and configure. It's computationally efficient.

However, once the table's structure starts to change from document to document, a different approach may be needed. Different parties are going to structure tables however they want, which is well outside your control. Think of all the different ways an invoice can be structured. While the information you want is present in all the different tables, how that data is presented may not be consistent. Even just the column location changing can present problems for this method. A single row extractor may not be able to do the job (or a complicated row extractor with multiple formats will need to be used). For these situations, the Header-Value method may be easier to configure and produce better results.

Header-Value

The Header-Value method uses a combination of extractors to detect the layout of a table.

The Header-Value technique excels in cases where you are trying to extract similar information from a variety of table layouts in a document set. Some of the tables may have table lines present, some may not. Columns may be in different orders. Some tables may have more columns than others. The column descriptions may be different. The format of data inside the cells (and therefore the pattern needed to find it) may be different. If you're trying to tackle a problem of variety, the Header-Value method may be fore you. Potentially, this technique can even target future table layouts sight unseen (as long as its configuration matches up with the previously unseen table structure).

However, Header-Value can be overkill for simpler tables whose structure is very straightforward and unchanging. Row Match is still the preferred method for those cases.