Table Extraction

From Grooper Wiki
Jump to navigation Jump to search
Data in an Excel spreadsheet is an example of tabular data.

Table Extraction refers to Grooper's functionality to extract data from cells in tables. This is accomplished by configuring the Data Table Data Element in a Data Model.

Tables are one of the most common ways data is organized on documents. Human beings have been writing information into tables before they started writing literature, even before paper was invented. There are examples of tables carved onto the walls of Egyptian temples! They are excellent structures for representing a lot of information with various characteristics in common in a relatively small space (or an Egyptian temple sized space). However, targeting the data inside them presents its own set of challenges. A table’s structure can range from simple and straightforward to more complex (even confounding). Different organizations may organize the same data differently, creating different tables for what, essentially, is the same data.

In Grooper, tabular data can be extracted to Data Table objects using the Row Match, Header-Value, or Infer Grid table extraction methods.


What Is a Table?

Tables consists of rows and columns. Where those rows and columns intersect are cells. These are the individual units of the table containing individual pieces of data. Each row consists of the same number of columns (although some columns may be empty in a given row). A single column consists of the same type of information. For example, an "Order Date" column will always have dates in the cells below it. The rows themselves are usually (but not always) in some order as well, such as in order of ascending date.

Table row.png Table column.png Table cell.png

This may seem obvious, but understanding how data is structured on the page informs how you will use Grooper to target it.

What is a Table in Grooper?

Node tree table extraction.png

In Grooper, tables are represented as Data Table objects in a Data Model. Each column is represented as a Data Column object, created as a children of the Data Table. Rows and their individual cells are created and populated upon successful data extraction.

To the left is a Content Model in the Node Tree. It contains a Data Table with Data Columns, representing the table's structure.

How raw text data is targeted and extracted to populate each row in the table is determined by the Data Table object's Extract Method property. This can be done using either the Row Match, the Header-Value, or the Infer Grid method.

Once the Data Table and its Data Columns are configured (according to the Extract Method), Grooper populates the table rows and extracts data to each cell.
Grooper table example.png

Use Cases: Extract Tables From PDF or Images

Tables are used by every organization in an innumerable number of business spaces in limitless ways. Examples of tabular data can be found in...

table extraction extract table from invoice
Table-extraction-extract-table-from-eob-form.png

Table-extraction-extract-table-from-material-declaration-sheet.png

Table-extraction-extract-table-from-royalty-report.png

Table-extraction-extract-table-from-multidimensional-tables.png

Closing-disclosure-H25B-2.ee4c6ea57315.png

Table-extraction-extract-tables-from-government-forms.png

Spreadsheet with lines.png

Spreadsheet no lines Page 01.png

Even data which may not initially seem like a table can be represented as a table. For example, an email inbox is essentially one big table of information pertaining to messages sent to an email address. It has columns like "From", "Received Date", "Subject" and more for rows of messages.


Email table.png


In fact, the main benefit to putting data in a table is that you can easily encapsulate repeated instances of data with multiple similar characteristics. Every row is just a collection of related data sharing characteristics defined by each column. Even if that information isn't presented in a table-like structure, sometimes Grooper can use the same table extraction methods to target the data and format it into a table.

The document below is a list of different contract language for different types of clauses. The information is not presented in a table but it does have some similar qualities to tabular data. It has a repeatable sections of information each containing similar pieces of information, in this case at least the type of clause, the language of the clause, and the number of contracts this language appears.


Table-extraction-extract-table-from-pdf-page-1.png Table-extraction-extract-table-from-pdf-page-2.png Table-extraction-extract-table-from-pdf-page-3.png Table-extraction-extract-table-from-pdf-page-4.png


With clever configuration of the Row Match table extract method, this information can be extracted into a table in Grooper, seen below.


Contract clause grooper table.png

Export Table Data to Database Tables

Once Grooper has extracted all this information from tables on a document set, what do you do with it? A database table is the perfect location for extracted table data on a document. All information collected from a document's table can be exported to a SQL database (or any ODBC compliant database) from Grooper. Once a connection to a SQL database is established, you can even create a database table directly from Grooper using the Data Table and Data Columns in your Content Model. Then, it's just a matter of mapping the Data Columns from the Grooper Data Table to the SQL table columns. This creates a logical connection where all the extracted information in Column A from the Grooper Data Table gets put into the corresponding Column A of the SQL database table.

The original document The Grooper extracted table The data exported to a SQL database
Simpletable.png Table-extraction-extract-table-data.png Table-extraction-export-table-data-to-sql-database.png

Table Extract Methods Overview

There are three different methods to extract data from tables. They are set on the Extract Method property of the Data Table object. Each method identifies a table's structure and extracts each cell differently and has their own benefits and limitations. The three methods are as follows:

  1. Row Match
  2. Header-Value
  3. Infer Grid

Row Match

The Row Match method uses regular expression pattern matching to find each row of the table. It determines the table's structure based on the structure of a row in the table. This can be done in a number of different ways, but the general idea is to use an extractor that matches the structure of each row. For this table, a pattern or series of patterns would be written to find a date, followed by five digits, followed by a name, followed by a one to two digit number, followed by a dollar amount.

Row Match was the first table extraction method available in Grooper. It is the simplest to configure and is often the first "go-to" method for Grooper designers. And, if it can't produce results, one of the other methods is used instead.

This method is useful for fairly simple tables without much variation across a document set. If you are processing a large number of the same report (or even very similar reports in some cases) over and over where the table structure (column positions and cell formatting) remains consistent, Row Match is usually the best way to target and extract tabular data. It's easy to set up (often just using a single regular expression pattern). There are most often fewer objects to create and configure. This gives it the benefit of being computationally efficient as well.

However, once the table's structure starts to change from document to document, a different approach may be needed. Different parties are going to structure tables however they want, which is well outside your control. Think of all the different ways an invoice can be structured. While the information you want is present in all the different tables, how that data is presented may not be consistent. Even just the column location changing can present problems for this method. A single row extractor may not be able to do the job (or a complicated row extractor with multiple formats will need to be used). For these situations, the Header-Value method may be easier to configure and produce better results.

Header-Value

The Header-Value method uses a combination of extractors to detect the layout of a table. First, you will create a set of extractors to find each header label for the table ("Order Date" for example). Second, you will create a set of extractors to find the values in each column (The individual dates in whatever format they may appear, for example).

The Header-Value technique excels in cases where you are trying to extract similar information from a variety of table layouts in a document set. Some of the tables may have table lines present, some may not. Columns may be in different orders. Some tables may have more columns than others. The column header descriptions may be different. The format of data inside the cells (and therefore the pattern needed to find it) may be different. If you're trying to tackle a problem of variety, the Header-Value method may be fore you. Potentially, this technique can even target future table layouts sight unseen (as long as its configuration matches up with the previously unseen table structure).

However, Header-Value can be overkill for simpler tables whose structure is very straightforward and unchanging. Row Match is still the preferred method for those cases. Header-Value needs an extractor to find every column header and the column's value, leading it to be more time consuming and complicated to set up. In many cases, Row Match can extract table data using only a single extractor and, as such, is much more efficient. Furthermore, there are tables out there that do not have header labels. The Header-Value method can target these types of tables, but can be more difficult to set up, or better targeted by the Row Match method.

Infer Grid

The Infer Grid method uses header extractors to determine the cell structure of the table. It generates a grid corresponding to the table's structure based off column and/or row header labels. In other words, it infers a grid structure from the header positions, hence the name.

At least one set of either column labels or row labels must be used to infer the grid using an "X Axis Extractor" or a "Y Axis Extractor". Most often an "X Axis Extractor" is used to find where the column headers are on the table and the cell locations are determined by the table's lines (whose locations were obtained via a "Detect Lines" or "Remove Lines" IP Command). Grooper then draws a grid on the document and each column's cell is extracted from the data falling within the corresponding grid's cell. The grid can also be inferred using a a "Y Axis Extractor" looking for row header labels if column headers are not available.

Optionally, both an "X Axis Extractor" and a "Y Axis Extractor" can be used to infer the table structure using both row headers and column headers. This approach can infer a grid on tables that do not have lines drawn between rows and columns. Knowing a location of a row header and the location of a column header, Grooper can infer the cell location of where they intersect on the table. Once Grooper knows that location, it can extract the text falling inside that location even without boundary lines drawn around the cell.

Infer Grid excels at table extraction where table lines are present. It also excels at some specialized use cases. Sometimes, text falling in a table's cells use a specialized font not easily read by the main OCR profile used to grab a document's text data. In these cases, a secondary OCR profile can be run on the cells in a column to reprocess the raw text data, targeting that specialized font. Infer Grid also has functionality to easily read checkboxes in a table through OMR detection.