2.90:Use Case: Transposing Rows and Columns with Row Match

From Grooper Wiki

So you have a set of documents. Most of them have a very straightforward table. It's easily extracted using the Row Match method. But every now and then, you get another document with a table presenting the same data, but in a wildly different table structure. So different, in fact, that the data is presented as rows instead of columns. One solution would be to create two different Data Models for two different Document Types with two different Data Tables using two different extraction techniques. However, there may be a more elegant solution, using what we call the "Pivot Table Method" and a "Pivot Row Extractor".

😎 Special thanks to BIS team member Larry Bellipanni (with help from Bradley White) for contributing the idea for this article!

The Problem

This use case comes from the world of oil and gas. Oil and gas companies receive and produce production reports detailing volumes and sales of materials produced for a particular oil and gas well. They are essentially big tables of information. But there's a lot of different ways to make a table. The vast majority of the documents for this project looked something like this.

This has a very straightforward table structure. It is well suited for the Row Match method. You could fairly easily write a pattern to match the data on each row of this table.

However, sometimes, this project would encounter documents that look something like this one. It has a completely different table structure. In fact it uses a combination of field labels for the "Well" and "Field" etc values, and a table for the "Oil" "Water" and "Gas" data.

While these tables present data in different ways, they share a great deal of data points. For this explanation, we will focus on eight shared data points. These data points will become the Data Columns for our Data Table.

  • Well Name (also known as "Completion Name")
  • API Number
  • Oil Produced
  • Oil Sales
  • Gas Produced
  • Gas Sales
  • Water Produced
  • Producing Days

The problem is part of the data that is just contained on a single row on this document...

...is in multiple columns on a table in this one, as seen by the yellow and blue highlighted data.

Or, it's not even in the table at all, as seen by the green highlighted data.

So, the question is: How can we use the Row Match method to extract this data, populating it into a single table structure, using a single Row Extractor, from two very different kinds of documents?

The Solution

The answer is to use what we call the "Pivot Table Method". The idea is to translate the columnar data in the second documents into rowed data like the first document. We will "pivot" from data organized in columns to data organized in rows. Or another way to think of it is we are going to transpose the X and Y axis of the table. We will also end up "pivoting" the data listed as fields in repeating sections to cells in the table's rows.

Let's get into Grooper and see how it's done! If you'd like a copy of this Batch and Content Model to inspect in your own Grooper environment (version 2.90), you can download the zip file below and import it.

The Documents: Oil and Gas Production Data Tables

For this use case, imagine the vast majority of our documents look like this one. These are mocked up documents coming from "Portmanteau Oil Company". It contains data pertaining to production volume and sales of oil and gas for a particular well site over a one month period.

This is a very basic, spreadsheet style table. We could write a single regex pattern to match the pattern of each row. This informs Grooper how to structure the table and extract data for each cell, using the Row Match table extraction method.

However, some of these reports come from a different company, "Scrableword Producers". They output their production data reports in a much different format. Unless you are the one controlling how a document is made, you may not have a lot of control over how information is structured from one company to the next.

This presents a problem of variability. There is no "one-size-fits-all" solution here. So the question is, what's the easiest, most efficient solution to extract both tables?

We will show you our solution using the "Pivot Table Method" described earlier, manipulating the data inside this table structure to conform it to the one we want.

The Content Model

To accomplish this task, we need a Content Model. Content Models define classification and extraction logic for a batch of documents.

This Content Model has two Document Types one for each mock vendor using a different table structure on their documents. The are as follows:

  • Portmanteau - For "Portmanteau Oil Company". This company uses the very basic table structure on its production data reporting document..
  • Scrableword - For "Scrableword Producers". This company uses the more complex table structure, presenting us our problem we have to solve.

For this exercise, we are only focusing on extracting the table. To extract any kind of data, we need a Data Model to detail what we want to extract (as well as define the extractors returning the data from the documents).

Since we are extracting a table, we will need a Data Table object in the Data Model. This object defines configuration settings for table extraction, most importantly the table extraction Method. For this exercise, we will use the Row Match method.

A table without columns is, frankly, not a table. The Data Table must have one or more Data Column children objects. These define which of the table's columns we want to extract. We could extract all the columns' data, but for our use case, we don't want to. We will a single Data Table, named "Production", with eight Data Columns:

  • Well Name
  • API
  • Oil Prod
  • Oil Sales
  • Gas Prod
  • Gas Sales
  • Water Prod
  • Days

The Primary Row Extractor

We will focus on the easiest document first, the "Portmanteau" Doc Type. This is doubly useful because, in our use case, the majority of documents use this table structure.

  1. What we have here is a Data Type with a pattern that matches each row of the table. It is named "TBL-RM - Production [Portmanteau]".
    • This is the first thing the Row Match method needs to know in order to function. For straightforward tables like this, if a table's structure can be described as rows of information, and you can pattern match what that row looks like, you can model the structure of the entire table, whether it's one row, twenty, or a million.
  2. In our results list, you can see each row is returned as a full line. Since, we have a result for each row line, we've successfully modeled our table's structure.

  1. The pattern for this extractor is located in the child Data Format named "Row Pattern".
    • Note: This could have also just been an internal pattern on the parent Data Type using its Pattern property.
  2. The Value Pattern essentially matches each cell's data with Tab Marking enabled to match the large gaps between each cell as tab characters (\t). The full pattern is as follows:
(?<Well_Name>[^\r\n\t\f]+)\t
(?<API>\d{10})\t
(?<Oil_Prod>\d{1,6}\.\d{2})\t
(?<Oil_Sales>\d{1,6}\.\d{2})\t
\d{1,6}\.\d{2}\t
(?<Gas_Prod>\d{1,6}\.\d{2})\t
(?<Gas_Sales>\d{1,6}\.\d{2})\t
(?<Water_Prod>\d{1,6}\.\d{2})\t
\d{1,6}\.\d{2}\t
\d{1,6}\.\d{2}\t
(?<Days>\d{2})
  1. This pattern also takes advantage of Named Groups. Named Groups will create data instances for each portion of regex captured by the group encapsulated in parenthesis. For the Row Match method, if the data instance's name matches the name of a Data Column it will populate that column with the data instance's text. For example, the Named Group "Days" matches the Data Column named "Days". The two digits picked up by the regex in that group \d{2} will populate the table cells for the "Days" column.
  2. This Data Format also uses a Prefix Pattern to only match rows that begin on a new line or the begining of a string (in this case the first line of the document) using the regex \n|^.
  3. It also uses a Suffix Pattern to only match rows that terminate at the end of a line or the end of a string (in this case the last line of the document) using the regex \r|$
    • Note: The Value Pattern would still match without the Prefix and Suffix Patterns. However, it is generally regarded as best practice to include these patterns (or patterns like them) as positional anchors for tables. This gives more specificity to your row patterns and can eliminate false positives in many cases.

If we were to set this extractor as our Data Table's Row Extractor, we would extract the document's table data with no problem.

However, this is only the case for the "Portmanteau" documents.

This extractor will not work at all for the "Scrableword" documents. It has a different table structure. The extractor does not properly match the "table".

We need a completely different extractor for this table, pivoting the data into the table structure we have created for our "Portmanteau" documents with what we're calling a "Pivot Row Extractor"

The Pivot Row Extractor

Instead of a table presenting the data we want in single rows, this document lays out the data in a repeating section. Some of the data are fields within these repeating sections. Some of the data is inside a table within these repeating section, but the data is organized in columns instead of rows. Overall, a completely different way of structuring this data on the page.

  1. Rather than using a single pattern to model the structure of a table, we will use an Ordered Array to find the same data elements within these sections.
  2. What we produce is four results, each containing all the data we want in each section. We will pivot this information, using the data instances inside each result, to match the rows of our Data Table.
  3. Instead of Named Groups in a single regex line producing these data instances, multiple Data Formats, as well as Named Groups within certain of these Data Formats will produce the data instances.

  1. The Data Format named "Well Name" will create a data instance that matches our Data Column named "Well Name".
    • This is analogous to the Named Group of our Primary Row Extractor populating the "Well Name" column using a Named Group to create a data instance of a portion of the regex line.
  2. The extraction technique here is to use the field's label "Well: " as a Prefix Pattern
  3. And capture the field's value (the well's name) with a simple segment pattern [^\r\n\t\f]+
    • Note: You will need to enable Tab Marking in the Preprocessing Options of the "Properties" tab for this to work.
  4. We are left with four well names for four sections. This data will populate the "Well Name" column of our Data Table.

The "API" Data Format uses similar logic, just using a different Prefix Pattern and Value Pattern to match the appropriate data.

Now for the fun part. How do we take a table organized by columns and flip it into a table organized by rows? That is what the three "Pivot Row Pattern" child Data Formats are for.

We have three patterns (one for each child Data Format), each one picking up a row of the table in these repeating sections. In those rows contain elements of our Data Table. We will use Named Groups to call out where that data should go in our Data Column. Again, if the data instance's name matches the name of our Data Column, it doesn't matter how it's structured or where it is in the array of data we ultimately create for the parent Data Type. As long as the name matches, it will get placed in the appropriate column.

  1. Now, it's just a matter of writing a pattern to match each row. We will start with the first row detailing information about oil production, and the "Pivot Row Pattern - Oil" Data Format
  2. We can hone in on where the "Oil" row is on this document using a Prefix Pattern
    • The regex oil \(bbls\):\s will match the row's label "Oil (bbls): " on the document.
  3. This row contains three pieces of information relevant for our table: oil production, oil sales, and producing days. We simply need to place the portions of the Value Pattern's regex matching this data into Named Groups that match our Data Column names: Oil Prod, Oil Sales, and Days. The Value Pattern seen here is as follows:
(?<Oil_Prod>\d{1,3}(,\d{3})?\.\d{2})\t
(?<Oil_Sales>\d{1,3}(,\d{3})?\.\d{2})\t
\d{1,3}\.\d{2,5}\t
\d{1,3}(,\d{3})?\.\d{2}\t
\d{1,3}(,\d{3})?\.\d{2}\t
(?<Days>\d{2})
  • Note: If your Data Columns have spaces in them, you must replace the spaces with an underscore (_) in the regex pattern.

The next "Pivot Row" is even easier. The second row of this table details information about water production. We only need a single piece of data from this row (and it happens to be in the first column of the table on the document).

  1. For the "Pivot Row Pattern - Water" Data Format
  2. We hone in on where the "Water" row is using the Prefix Pattern water \(bbls\):\s
  3. And for our Value Pattern we simply match data in the first column with a Named Group matching the Data Column named "Water Prod"
    • Note: We could have also named the Data Format "Water Prod" instead of using a Named Group. Either way creates a named data instance that matches our Data Column!

Lastly, the "Pivot Row Pattern - Gas" Data Format captures the data in the "Gas" row, using two Named Groups, one for the "Gas Prod" and "Gas Sales" Data Columns.

We're almost done. We just need to organize the results of our child Data Formats in a way the Data Table can use as a Row Extractor. To do that, we need to collate this data using the Collation property of the parent Data Type.

  1. If we leave this Data Type's Collation property set to Individual...
  2. You can see we get individual results for the various data elements we want to use to populate our table.
    • But, we need to re-order it into results the Data Table's Row Extractor can use to create data instances out of each row, also known as "Row Instances". These Row Instances will then pull the data instances created by our Data Format to populate the correspondingly named Data Columns.

To do this we are going to use the Ordered Array Collation Provider in Flow Layout. A Flow collated Ordered Array, will look for results of its child extractors in the order they are listed from top to bottom and return them as a single result, as long as they appear in the text flow in that order.

So you're reading the text data like you would a book, left-to-right and top-to-bottom. You see a result the "Well Name" data format extracts and keep reading through the text flow. You get a result from the next child extractor, the "API" Data Format. Good, that's the next extractor in the list. You keep reading along and get one for the "Pivot Row Pattern - Oil" extractor, then the "Pivot Row Pattern - Water" and last the "Pivot Row Pattern - Gas" extractor.

As long as each subsequent child extractor result follows the one before it in the text flow of the document, you will get a single result containing all the data each extractor returns. If it's out of order, or one element is missing, no luck. But here, we get the result we're wanting.

  1. With the Collation property set to Ordered Array...
  2. And the Flow Layout property set to Enabled...
  3. And the Separator Expression property set to .*, we return four results for the four sections of oil and gas production data that will return to our Data Table.
    • The Separator Expressoin or Maximum Character Distance properties are important aspects of how Flow Layout functions. You must tell Grooper what you expect to find in between each element of the ordered array. By setting the expression to .* we are essentially saying there can be any character (the . wildcard) of any length (the * quantifier) in between each child extractor's result in the text flow.

If then set this Data Type as our Data Table's Row Extractor, we will successfully extract the data we want from this document, pivoting the sections and inverted table structure into the one we want.

But we're not quite done yet. We have two extractors for our two documents. We want a single extractor to use for our Row Extractor.

Put It All Together

How do you create one extractor out of two different extractors producing different results on different documents? Just make them both children of another Data Type. The new parent Data Type will receive both sets of values for each child extractor.

This single extractor now extracts the desired data for two very different kinds of documents.

We've pivoted the data inside the repeating sections of this document, into the table structure that works well for the majority of documents in our document set.