2.90:Use Case: Transposing Rows and Columns with Row Match
So you have a set of documents. Most of them have a very straightforward table. It's easily extracted using the Row Match method. But every now and then, you get another document with a table presenting the same data, but in a wildly different table structure. So different, in fact, that the data is presented as rows instead of columns. One solution would be to create two different Data Models for two different Document Types with two different Data Tables using two different extraction techniques. However, there may be a more elegant solution, using what we call the "Pivot Table Method" and a "Pivot Row Extractor".
😎 | Special thanks to BIS team member Larry Bellipanni (with help from Bradley White) for contributing the idea for this article! |
The Problem
While these tables present data in different ways, they share a great deal of data points. For this explanation, we will focus on eight shared data points. These data points will become the Data Columns for our Data Table.
- Well Name (also known as "Completion Name")
- API Number
- Oil Produced
- Oil Sales
- Gas Produced
- Gas Sales
- Water Produced
- Producing Days
So, the question is: How can we use the Row Match method to extract this data, populating it into a single table structure, using a single Row Extractor, from two very different kinds of documents?
The Solution
The answer is to use what we call the "Pivot Table Method". The idea is to translate the columnar data in the second documents into rowed data like the first document. We will "pivot" from data organized in columns to data organized in rows. Or another way to think of it is we are going to transpose the X and Y axis of the table. We will also end up "pivoting" the data listed as fields in repeating sections to cells in the table's rows.
Let's get into Grooper and see how it's done! If you'd like a copy of this Batch and Content Model to inspect in your own Grooper environment (version 2.90), you can download the zip file below and import it.
The Documents: Oil and Gas Production Data Tables
For this use case, imagine the vast majority of our documents look like this one. These are mocked up documents coming from "Portmanteau Oil Company". It contains data pertaining to production volume and sales of oil and gas for a particular well site over a one month period. This is a very basic, spreadsheet style table. We could write a single regex pattern to match the pattern of each row. This informs Grooper how to structure the table and extract data for each cell, using the Row Match table extraction method. |
![]() |
However, some of these reports come from a different company, "Scrableword Producers". They output their production data reports in a much different format. Unless you are the one controlling how a document is made, you may not have a lot of control over how information is structured from one company to the next. This presents a problem of variability. There is no "one-size-fits-all" solution here. So the question is, what's the easiest, most efficient solution to extract both tables? We will show you our solution using the "Pivot Table Method" described earlier, manipulating the data inside this table structure to conform it to the one we want. |
![]() |
The Content Model
To accomplish this task, we need a Content Model. Content Models define classification and extraction logic for a batch of documents. |
|
This Content Model has two Document Types one for each mock vendor using a different table structure on their documents. The are as follows:
|
|
For this exercise, we are only focusing on extracting the table. To extract any kind of data, we need a Data Model to detail what we want to extract (as well as define the extractors returning the data from the documents). |
|
Since we are extracting a table, we will need a Data Table object in the Data Model. This object defines configuration settings for table extraction, most importantly the table extraction Method. For this exercise, we will use the Row Match method. |
|
A table without columns is, frankly, not a table. The Data Table must have one or more Data Column children objects. These define which of the table's columns we want to extract. We could extract all the columns' data, but for our use case, we don't want to. We will a single Data Table, named "Production", with eight Data Columns:
|
The Primary Row Extractor
We will focus on the easiest document first, the "Portmanteau" Doc Type. This is doubly useful because, in our use case, the majority of documents use this table structure.
|
|
(?<Well_Name>[^\r\n\t\f]+)\t
(?<API>\d{10})\t
(?<Oil_Prod>\d{1,6}\.\d{2})\t
(?<Oil_Sales>\d{1,6}\.\d{2})\t
\d{1,6}\.\d{2}\t
(?<Gas_Prod>\d{1,6}\.\d{2})\t
(?<Gas_Sales>\d{1,6}\.\d{2})\t
(?<Water_Prod>\d{1,6}\.\d{2})\t
\d{1,6}\.\d{2}\t
\d{1,6}\.\d{2}\t
(?<Days>\d{2})
|
|
If we were to set this extractor as our Data Table's Row Extractor, we would extract the document's table data with no problem. However, this is only the case for the "Portmanteau" documents. |
|
This extractor will not work at all for the "Scrableword" documents. It has a different table structure. The extractor does not properly match the "table". We need a completely different extractor for this table, pivoting the data into the table structure we have created for our "Portmanteau" documents with what we're calling a "Pivot Row Extractor" |
The Pivot Row Extractor
Instead of a table presenting the data we want in single rows, this document lays out the data in a repeating section. Some of the data are fields within these repeating sections. Some of the data is inside a table within these repeating section, but the data is organized in columns instead of rows. Overall, a completely different way of structuring this data on the page.
|
|
|
|
The "API" Data Format uses similar logic, just using a different Prefix Pattern and Value Pattern to match the appropriate data. |
|
Now for the fun part. How do we take a table organized by columns and flip it into a table organized by rows? That is what the three "Pivot Row Pattern" child Data Formats are for. We have three patterns (one for each child Data Format), each one picking up a row of the table in these repeating sections. In those rows contain elements of our Data Table. We will use Named Groups to call out where that data should go in our Data Column. Again, if the data instance's name matches the name of our Data Column, it doesn't matter how it's structured or where it is in the array of data we ultimately create for the parent Data Type. As long as the name matches, it will get placed in the appropriate column.
(?<Oil_Prod>\d{1,3}(,\d{3})?\.\d{2})\t
(?<Oil_Sales>\d{1,3}(,\d{3})?\.\d{2})\t
\d{1,3}\.\d{2,5}\t
\d{1,3}(,\d{3})?\.\d{2}\t
\d{1,3}(,\d{3})?\.\d{2}\t
(?<Days>\d{2})
|
|
The next "Pivot Row" is even easier. The second row of this table details information about water production. We only need a single piece of data from this row (and it happens to be in the first column of the table on the document).
|
|
Lastly, the "Pivot Row Pattern - Gas" Data Format captures the data in the "Gas" row, using two Named Groups, one for the "Gas Prod" and "Gas Sales" Data Columns. |
|
We're almost done. We just need to organize the results of our child Data Formats in a way the Data Table can use as a Row Extractor. To do that, we need to collate this data using the Collation property of the parent Data Type.
|
|
To do this we are going to use the Ordered Array Collation Provider in Flow Layout. A Flow collated Ordered Array, will look for results of its child extractors in the order they are listed from top to bottom and return them as a single result, as long as they appear in the text flow in that order. So you're reading the text data like you would a book, left-to-right and top-to-bottom. You see a result the "Well Name" data format extracts and keep reading through the text flow. You get a result from the next child extractor, the "API" Data Format. Good, that's the next extractor in the list. You keep reading along and get one for the "Pivot Row Pattern - Oil" extractor, then the "Pivot Row Pattern - Water" and last the "Pivot Row Pattern - Gas" extractor. As long as each subsequent child extractor result follows the one before it in the text flow of the document, you will get a single result containing all the data each extractor returns. If it's out of order, or one element is missing, no luck. But here, we get the result we're wanting.
|
|
If then set this Data Type as our Data Table's Row Extractor, we will successfully extract the data we want from this document, pivoting the sections and inverted table structure into the one we want. But we're not quite done yet. We have two extractors for our two documents. We want a single extractor to use for our Row Extractor. |
Put It All Together
How do you create one extractor out of two different extractors producing different results on different documents? Just make them both children of another Data Type. The new parent Data Type will receive both sets of values for each child extractor. |
||
This single extractor now extracts the desired data for two very different kinds of documents. |
||
We've pivoted the data inside the repeating sections of this document, into the table structure that works well for the majority of documents in our document set. |