2.80:Table Extraction (Concept): Difference between revisions

Revision as of 13:12, 9 January 2020

Data in an Excel spreadsheet is an example of tabular data.

Tables are one of the most common ways data is organized on documents. Human beings have been writing information into tables before they started writing literature, even before paper was invented. There are examples of tables carved onto the walls of Egyptian temples! They are excellent structures for representing a lot of information with various characteristics in common in a relatively small space (or an Egyptian temple sized space). However, targeting the data inside them presents its own set of challenges. A table’s structure can range from simple and straightforward to more complex (even confounding). Different organizations may organize the same data differently, creating different tables for what, essentially, is the same data.

In Grooper, tabular data can be extracted using the Row Match, Header-Value, or Infer Grid table extraction methods.

What Is a Table?

Tables consists of rows and columns. Where those rows and columns intersect are cells. These are the individual units of the table containing individual pieces of data. Each row consists of the same number of columns (although some columns may be empty in a given row). A single column consists of the same type of information. For example, an "Order Date" column will always have dates in the cells below it. The rows themselves are usually (but not always) in some order as well, such as in order of ascending date.

This may seem obvious, but understanding how data is structured on the page informs how you will use Grooper to target it.

What is a Table in Grooper?

In Grooper, tables are represented as Data Table objects in a Data Model. Each column is represented as a Data Column object, created as a children of the Data Table. Rows and their individual cells are created and populated upon successful data extraction.

To the left is a Content Model in the Node Tree. It contains a Data Table with Data Columns, representing the table's structure.

How raw text data is targeted and extracted to populate each row in the table is determined by the Data Table object's "Extract Method" property. This can be done using either the Row Match, the Header-Value, or the Infer Grid method.

Once the Data Table and its Data Columns are configured (according to the Extract Method), Grooper populates the table rows and extracts data to each cell.

Use Cases

Tables are used by every organization in an innumerable number of business spaces in limitless ways. Examples of tabular data can be found in...

Even data which may not initially seem like a table can be represented as a table. For example, an email inbox is essentially one big table of information pertaining to messages sent to an email address. It has columns like "From", "Received Date", "Subject" and more for rows of messages.

In fact, the main benefit to putting data in a table is that you can easily encapsulate repeated instances of data with multiple similar characteristics. Every row is just a collection of related data sharing characteristics defined by each column. Even if that information isn't presented in a table-like structure, sometimes Grooper can use the same table extraction methods to target the data and format it into a table.

The document below is a list of different contract language for different types of clauses. The information is not presented in a table but it does have some similar qualities to tabular data. It has a repeatable sections of information each containing similar pieces of information, in this case at least the type of clause, the language of the clause, and the number of contracts this language appears.

With clever configuration of the Row Match table extract method, this information can be extracted into a table in Grooper, seen below.

Tables and Database Tables

Once Grooper has extracted all this information from tables on a document set, what do you do with it? A database table is the perfect location for extracted table data on a document. All information collected from a document's table can be exported to a SQL database (or any ODBC compliant database) from Grooper. Once a connection to a SQL database is established, you can even create a database table directly from Grooper using the Data Table and Data Columns in your Content Model. Then, it's just a matter of mapping the Data Columns from the Grooper Data Table to the SQL table columns. This creates a logical connection where all the extracted information in Column A from the Grooper Data Table gets put into the corresponding Column A of the SQL database table.

need pics once i get my database export working

Table Extract Methods Overview

There are three different methods to extract data from tables. They are set on the "Extract Method" property of the Data Table object. Each method has its own benefits and limitations. The three methods are as follows:

Row Match
Header-Value
Infer Grid

Row Match

The Row Match method uses regular expression pattern matching to find each row of the table. This can be done in a number of different ways, but the general idea is to use an extractor that matches the structure of each row. For this table, a pattern or series of patterns would be written to find a date, followed by five digits, followed by a name, followed by a one to two digit number, followed by a dollar amount.

Row Match was the first table extraction method available in Grooper. It is the simplest to configure and is often the first "go-to" method for Grooper designers. And, if it can't produce results, one of the other methods is used instead.

This method is useful for fairly simple tables without much variation across a document set. Once the table's structure starts to change from document to document, a different approach may be needed. For example, even the column location changing can present problems for this method. Since each columns' pattern needs to match in the order they are returned by the row extractor, if one table has a column named "Date" in the first column and another has it as the third column, a single row extractor is not going to do the job (or a complicated row extractor with multiple formats will need to be used). For these situations, the Header-Value method may be easier to configure and produce better results.