Table Extraction (Concept)

From Grooper Wiki
(Redirected from Table Extraction)

This article was migrated from an older version and has not been updated for the current version of Grooper.

This tag will be removed upon article review and update.

This article is about the current version of Grooper.

Note that some content may still need to be updated.

2025 20232.80
Data in an Excel spreadsheet is an example of tabular data.

"Table Extraction" refers to Grooper's ability to extract data from cells in tables on documents. This is accomplished by configuring the table Data Table and its child view_column Data Column elements in a data_table Data Model.

Tables are one of the most common ways data is organized on documents. Human beings have been writing information into tables before they started writing literature, even before paper was invented. There are examples of tables carved onto the walls of Egyptian temples! They are excellent structures for representing a lot of information with various characteristics in common in a relatively small space (or an Egyptian temple sized space). However, targeting the data inside them presents its own set of challenges. A table’s structure can range from simple and straightforward to more complex (even confounding). Different organizations may organize the same data differently, creating different tables for what, essentially, is the same data.

In Grooper, tabular data can be extracted to Data Table objects using one of many table extraction methods.

What Is a Table?

Tables consists of rows and columns. Where those rows and columns intersect are cells. These are the individual units of the table containing individual pieces of data. Each row consists of the same number of columns (although some columns may be empty in a given row). A single column consists of the same type of information. For example, an "Order Date" column will always have dates in the cells below it. The rows themselves are usually (but not always) in some order as well, such as in order of ascending date.

This may seem obvious, but understanding how data is structured on the page informs how you will use Grooper to target it.

What is a Table in Grooper?

In Grooper, tables are represented as Data Table objects in a Data Model. Each column is represented as a Data Column object, created as a children of the Data Table. Rows and their individual cells are created and populated upon successful data extraction.

To the left is a Content Model in the Node Tree. It contains a Data Table with Data Columns, representing the table's structure.

How raw text data is targeted and extracted to populate each row in the table is determined by the Data Table object's Extract Method property. This can be done using either the Row Match, the Header-Value, or the Infer Grid method.

Once the Data Table and its Data Columns are configured (according to the Extract Method), Grooper populates the table rows and extracts data to each cell.

Use Cases: Extract Tables From PDF or Images

Tables are used by every organization in an innumerable number of business spaces in limitless ways. Examples of tabular data can be found in...

table extraction extract table from invoice
table extraction extract table from invoice

Even data which may not initially seem like a table can be represented as a table. For example, an email inbox is essentially one big table of information pertaining to messages sent to an email address. It has columns like "From", "Received Date", "Subject" and more for rows of messages.



In fact, the main benefit to putting data in a table is that you can easily encapsulate repeated instances of data with multiple similar characteristics. Every row is just a collection of related data sharing characteristics defined by each column. Even if that information isn't presented in a table-like structure, sometimes Grooper can use the same table extraction methods to target the data and format it into a table.

The document below is a list of different contract language for different types of clauses. The information is not presented in a table but it does have some similar qualities to tabular data. It has a repeatable sections of information each containing similar pieces of information, in this case at least the type of clause, the language of the clause, and the number of contracts this language appears.



With clever configuration of the Row Match table extract method, this information can be extracted into a table in Grooper, seen below.


Export Table Data to Database Tables

Once Grooper has extracted all this information from tables on a document set, what do you do with it? A database table is the perfect location for extracted table data on a document. All information collected from a document's table can be exported to a SQL database (or any ODBC compliant database) from Grooper. Once a connection to a SQL database is established, you can even create a database table directly from Grooper using the Data Table and Data Columns in your Content Model. Then, it's just a matter of mapping the Data Columns from the Grooper Data Table to the SQL table columns. This creates a logical connection where all the extracted information in Column A from the Grooper Data Table gets put into the corresponding Column A of the SQL database table.

The original document The Grooper extracted table The data exported to a SQL database

Table Extract Methods Overview

There are seven different methods to extract data from tables. They are set on the Extract Method property of the Data Table object. Each method identifies a table's structure and extracts each cell differently and has their own benefits and limitations. The seven methods are as follows:

  1. Row Match
  2. Grid Layout
  3. Tabular Layout
  4. Header-Value
    • Please note: This method is only in newer versions of Grooper for backwards compatibility. You should use Tabular Layout instead for simpler setup and increased functionality. This method will be removed from Grooper in version 2023.1.
  5. Fluid Layout
    • Please note: This is a highly specialized method that requires Label Sets to function.
  6. Delimited Extract
    • Please note: This method is only designed to work on character delimited text files, such as CSV files.
  7. Fixed Width
    • Please note: This method is only designed to work on fixed-width text files.