2.90:Use Case: Transposing Rows and Columns with Row Match

From Grooper Wiki

So you have a set of documents. Most of them have a very straightforward table. It's easily extracted using the Row Match method. But every now and then, you get another document with a table presenting the same data, but in a wildly different table structure. So different, in fact, that the data is presented as rows instead of columns. One solution would be to create two different Data Models for two different Document Types with two different Data Tables using two different extraction techniques. However, there may be a more elegant solution, using what we call the "Pivot Table Method" and a "Pivot Row Extractor".

The Problem

This use case comes from the world of oil and gas. Oil and gas companies receive and produce production reports detailing volumes and sales of materials produced for a particular oil and gas well. They are essentially big tables of information. But there's a lot of different ways to make a table. The vast majority of the documents for this project looked something like this.

This has a very straightforward table structure. It is well suited for the Row Match method. You could fairly easily write a pattern to match the data on each row of this table.

However, sometimes, this project would encounter documents that look something like this one. It has a completely different table structure. In fact it uses a combination of field labels for the "Well" and "Field" etc values, and a table for the "Oil" "Water" and "Gas" data.

While these tables present data in different ways, they share a great deal of data points. For this explanation, we will focus on eight shared data points. These data points will become the Data Columns for our Data Table.

  • Well Name (also known as "Completion Name")
  • API Number
  • Oil Produced
  • Oil Sales
  • Gas Produced
  • Gas Sales
  • Water Produced
  • Producing Days

The problem is part of the data that is just contained on a single row on this document...

...is in multiple columns on a table in this one, as seen by the yellow and blue highlighted data.

Or, it's not even in the table at all, as seen by the green highlighted data.

So, the question is: How can we use the Row Match method to extract this data, using a single Row Extractor?

The Solution

The answer is to use what we call the "Pivot Table Method". The idea is to translate the columnar data in the second documents into rowed data like the first document. We will "pivot" from data organized in columns to data organized in rows. Or another way to think of it is we are going to transpose the X and Y axis of the table. We will also end up "pivoting" the data listed as fields in repeating sections to cells in the table's rows.

Let's get into Grooper and see how it's done! If you'd like a copy of this Batch and Content Model to inspect in your own Grooper environment (version 2.90), you can download the zip file below and import it.

  • !!!!insert media here Dylan!!!!!

The Documents: Oil and Gas Production Data Tables

For this use case, imagine the vast majority of our documents look like this one. These are mocked up documents coming from "Portmanteau Oil Company". It contains data pertaining to production volume and sales of oil and gas for a particular well site over a one month period.

This is a very basic, spreadsheet style table. We could write a single regex pattern to match the pattern of each row. This informs Grooper how to structure the table and extract data for each cell, using the Row Match table extraction method.

However, some of these reports come from a different company, "Scrableword Producers". They output their production data reports in a much different format. Unless you are the one controlling how a document is made, you may not have a lot of control over how information is structured from one company to the next.

This presents a problem of variability. There is no "one-size-fits-all" solution here. So the question is, what's the easiest, most efficient solution to extract both tables?

We will show you our solution using the "Pivot Table Method" described earlier, manipulating the data inside this table structure to conform it to the one we want.

The Content Model

To accomplish this task, we need a Content Model. Content Models define classification and extraction logic for a batch of documents.

This Content Model has two Document Types one for each mock vendor using a different table structure on their documents. The are as follows:

  • Portmanteau - For "Portmanteau Oil Company". This company uses the very basic table structure.
  • Scrableword - For "Scrableword Producers". This company uses the more complex table structure, presenting us our problem we have to solve.