2.90:Use Case: Transposing Rows and Columns with Row Match

From Grooper Wiki
Revision as of 14:21, 28 August 2020 by Dgreenwood (talk | contribs)

So you have a set of documents. Most of them have a very straightforward table. It's easily extracted using the Row Match method. But every now and then, you get another document with a table presenting the same data, but in a wildly different table structure. So different, in fact, that the data is presented as rows instead of columns. One solution would be to create two different Data Models for two different Document Types with two different Data Tables using two different extraction techniques. However, there may be a more elegant solution, using what we call the "Pivot Table Method" and a "Pivot Row Extractor".

The Problem

This use case comes from the world of oil and gas. Oil and gas companies receive and produce production reports detailing volumes and sales of materials produced for a particular oil and gas well. They are essentially big tables of information. But there's a lot of different ways to make a table. The vast majority of the documents for this project looked something like this.

This has a very straightforward table structure. It is well suited for the Row Match method. You could fairly easily write a pattern to match the data on each row of this table.

However, sometimes, this project would encounter documents that look something like this one. It has a completely different table structure. In fact it uses a combination of field labels for the "Well" and "Field" etc values, and a table for the "Oil" "Water" and "Gas" data.

While these tables present data in different ways, they share a great deal of data points. For this explanation, we will focus on eight shared data points. These data points will become the Data Columns for our Data Table.

  • Well Name (also known as "Completion Name")
  • API Number
  • Oil Produced
  • Oil Sales
  • Gas Produced
  • Gas Sales
  • Water Produced
  • Producing Days

The problem is part of the data that is just contained on a single row on this document...

...is in multiple columns on a table in this one, as seen by the yellow and blue highlighted data.

Or, it's not even in the table at all, as seen by the green highlighted data.

So, the question is: How can we use the Row Match method to extract this data, using a single Row Extractor?

The Solution

The answer is to use what we call the "Pivot Table Method". The idea is to translate the columnar data in the second documents into rowed data like the first document. We will "pivot" from data organized in columns to data organized in rows. Or another way to think of it is we are going to transpose the X and Y axis of the table. We will also end up "pivoting" the data listed as fields in repeating sections to cells in the table's rows.

Let's get into Grooper and see how it's done! If you'd like a copy of this Batch and Content Model to inspect in your own Grooper environment (version 2.90), you can download the zip file below and import it.

  • !!!!insert media here Dylan!!!!!

The Documents: Oil and Gas Production Data Tables

For this use case, imagine the vast majority of our documents look like this one. These are mocked up documents coming from "Portmanteau Oil Company". It contains data pertaining to production volume and sales of oil and gas for a particular well site over a one month period.

This is a very basic, spreadsheet style table. We could write a single regex pattern to match the pattern of each row. This informs Grooper how to structure the table and extract data for each cell, using the Row Match table extraction method.

However, some of these reports come from a different company, "Scrableword Producers". They output their production data reports in a much different format. Unless you are the one controlling how a document is made, you may not have a lot of control over how information is structured from one company to the next.

This presents a problem of variability. There is no "one-size-fits-all" solution here. So the question is, what's the easiest, most efficient solution to extract both tables?

We will show you our solution using the "Pivot Table Method" described earlier, manipulating the data inside this table structure to conform it to the one we want.

The Content Model

To accomplish this task, we need a Content Model. Content Models define classification and extraction logic for a batch of documents.

This Content Model has two Document Types one for each mock vendor using a different table structure on their documents. The are as follows:

  • Portmanteau - For "Portmanteau Oil Company". This company uses the very basic table structure on its production data reporting document..
  • Scrableword - For "Scrableword Producers". This company uses the more complex table structure, presenting us our problem we have to solve.

For this exercise, we are only focusing on extracting the table. To extract any kind of data, we need a Data Model to detail what we want to extract (as well as define the extractors returning the data from the documents).

Since we are extracting a table, we will need a Data Table object in the Data Model. This object defines configuration settings for table extraction, most importantly the table extraction Method. For this exercise, we will use the Row Match method.

A table without columns is, frankly, not a table. The Data Table must have one or more Data Column children objects. These define which of the table's columns we want to extract. We could extract all the columns' data, but for our use case, we don't want to. We will a single Data Table, named "Production", with eight Data Columns:

  • Well Name
  • API
  • Oil Prod
  • Oil Sales
  • Gas Prod
  • Gas Sales
  • Water Prod
  • Days

The Primary Row Extractor

We will focus on the easiest document first, the "Portmanteau" Doc Type. This is doubly useful because, in our use case, the majority of documents use this table structure.

  1. What we have here is a Data Type with a pattern that matches each row of the table. It is named "TBL-RM - Production [Portmanteau]".
    • This is the first thing the Row Match method needs to know in order to function. For straightforward tables like this, if a table's structure can be described as rows of information, and you can pattern match what that row looks like, you can model the structure of the entire table, whether it's one row, twenty, or a million.
  2. In our results list, you can see each row is returned as a full line. Since, we have a result for each row line, we've successfully modeled our table's structure.

  1. The pattern for this extractor is located in the child Data Format named "Row Pattern".
    • Note: This could have also just been an internal pattern on the parent Data Type using its Pattern property.
  2. The Value Pattern essentially matches each cell's data with Tab Marking enabled to match the large gaps between each cell as tab characters (\t). The full pattern is as follows:
(?<Well_Name>[^\r\n\t\f]+)\t
(?<API>\d{10})\t
(?<Oil_Prod>\d{1,6}\.\d{2})\t
(?<Oil_Sales>\d{1,6}\.\d{2})\t
\d{1,6}\.\d{2}\t
(?<Gas_Prod>\d{1,6}\.\d{2})\t
(?<Gas_Sales>\d{1,6}\.\d{2})\t
(?<Water_Prod>\d{1,6}\.\d{2})\t
\d{1,6}\.\d{2}\t
\d{1,6}\.\d{2}\t
(?<Days>\d{2})
  1. This pattern also takes advantage of Named Groups. Named Groups will create data instances for each portion of regex captured by the group encapsulated in parenthesis. For the Row Match method, if the data instance's name matches the name of a Data Column it will populate that column with the data instance's text. For example, the Named Group "Days" matches the Data Column named "Days". The two digits picked up by the regex in that group \d{2} will populate the table cells for the "Days" column.
  2. This Data Format also uses a Prefix Pattern to only match rows that begin on a new line or the begining of a string (in this case the first line of the document) using the regex \n|^.
  3. It also uses a Suffix Pattern to only match rows that terminate at the end of a line or the end of a string (in this case the last line of the document) using the regex \r|$
    • Note: The Value Pattern would still match without the Prefix and Suffix Patterns. However, it is generally regarded as best practice to include these patterns (or patterns like them) as positional anchors for tables. This gives more specificity to your row patterns and can eliminate false positives in many cases.

If we were to set this extractor as our Data Table's Row Extractor, we would extract the document's table data with no problem.

However, this is only the case for the "Portmanteau" documents.

This extractor will not work at all for the "Scrableword" documents. It has a different table structure. The extractor does not properly match the "table".

We need a completely different extractor for this table, pivoting the data into the table structure we have created for our "Portmanteau" documents with what we're calling a "Pivot Row Extractor"

The Pivot Row Extractor

Instead of a table presenting the data we want in single rows, this document lays out the data in a repeating section. Some of the data are fields within these repeating sections. Some of the data is inside a table within these repeating section, but the data is organized in columns instead of rows. Overall, a completely different way of structuring this data on the page.

  1. Rather than using a single pattern to model the structure of a table, we will use an Ordered Array to find the same data elements within these sections.
  2. What we produce is four results, each containing all the data we want in each section. We will pivot this information, using the data instances inside each result, to match the rows of our Data Table.
  3. Instead of Named Groups in a single regex line producing these data instances, multiple Data Formats, as well as Named Groups within certain of these Data Formats will produce the data instances.

  1. The Data Format named "Well Name" will create a data instance that matches our Data Column named "Well Name".
    • This is analogous to the Named Group of our Primary Row Extractor populating the "Well Name" column using a Named Group to create a data instance of a portion of the regex line.
  2. The extraction technique here is to use the field's label "Well: " as a Prefix Pattern
  3. And capture the field's value (the well's name) with a simple segment pattern [^\r\n\t\f]+
    • Note: You will need to enable Tab Marking in the Preprocessing Options of the "Properties" tab for this to work.
  4. We are left with four well names for four sections. This data will populate the "Well Name" column of our Data Table.

The "API" Data Format uses similar logic, just using a different Prefix Pattern and Value Pattern to match the appropriate data.

Now for the fun part. How do we take a table organized by columns and flip it into a table organized by rows? That is what the three "Pivot Row Pattern" child Data Formats are for.

We have three patterns (one for each child Data Format), each one picking up a row of the table in these repeating sections. In those rows contain elements of our Data Table. We will use Named Groups to call out where that data should go in our Data Column. Again, if the data instance's name matches the name of our Data Column, it doesn't matter how it's structured or where it is in the array of data we ultimately create for the parent Data Type. As long as the name matches, it will get placed in the appropriate column.

  1. Now, it's just a matter of writing a pattern to match each row. We will start with the "Oil" row and the "Pivot Row Pattern - Oil" Data Format
  2. We can hone in on where the "Oil" row is on this document using a Prefix Pattern
    • The regex oil \(bbls\):\s will match the row's label "Oil (bbls): " on the document.
  3. This row contains three pieces of information relevant for our table: oil production, oil sales, and producing days. We simply need to place the portions of the Value Pattern's regex matching this data into Named Groups that match our Data Column names: Oil Prod, Oil Sales, and Days. The Value Pattern seen here is as follows:
(?<Oil_Prod>\d{1,3}(,\d{3})?\.\d{2})\t
(?<Oil_Sales>\d{1,3}(,\d{3})?\.\d{2})\t
\d{1,3}\.\d{2,5}\t
\d{1,3}(,\d{3})?\.\d{2}\t
\d{1,3}(,\d{3})?\.\d{2}\t
(?<Days>\d{2})
  • Note: If your Data Columns have spaces in them, you must replace the spaces with an underscore (_) in the regex pattern.