2.80:Row Match (Table Extract Method)

Row Match is one of three methods available to Data Table data elements to extract information from tables on a document set. It uses regular expression pattern matching to determine a tables structure based on the pattern of each row and extract cell data from each column.

About

The configuration of Row Match is relatively simple. As you can see in the image below, there is only one property to configure, the "Row Extractor". Just like any other extractor, it can be an Internal pattern or a Reference to an extractor in the Node Tree.

This extractor is made to find the general pattern of each row in a table. There are many ways to accomplish this. We will look at a simple example with a fairly simple solution. Examine the table below.

What we are trying to do is model the table's structure using regular expression. Each row has a pattern to it from left to right. First a date, followed by a five digit number, followed by a name of variable length, followed by a one to two digit number followed by a variable dollar amount. This is the table's structure, a series of rows containing this information, in this order. We can use regular expression to match that pattern of data. First, we might break it down further and look at patterns to target each column, one after the other. That could look like something below.

Regular expression	Match found as seen on image	Notes
`\d{1,2}/\d{1,2}/\d{4}`		This pattern is just one of may ways to find this specific date format. This pattern would need to be adjusted to find other formats such as "month-day-year".
`\d{5}`		A simple regex to only find a five digit number.
`[A-Z]+`		This pattern also matches words outside of the row, but that's ok. We will put this all together and only match the row shortly.
`\d{1,2}`		Same with this pattern.
`[$]\d{1,3}(,\d{3})?[.]\d{2}`		Again, one of many ways to match currency values.

Now that we can match every cell in the row, we just need to put all that information together and match the whole row. We can enable "tab marking" in a pattern to insert the "\t" control character between segments of text separated by a large amount of whitespace (larger than a normal space character). We can use this tab character to make a regular expression pattern seen below to match the entire row.

Regular expression for the entire row	`\d{1,2}/\d{1,2}/\d{4}\t\d{5}\t\w+\t\d{1,2}\t[$]\d{1,3}(,\d{3})?[.]\d{2}`
Each row matches the single expression

Once you are able to match each row, the next part is to extract the data in that row corresponding to each row. This can be done in a number of ways, but an important thing to point out is what you're trying to match. Once Grooper finds a row using the Row Extractor, it creates an "instance" of that extracted row. You can think of this as a sub-element within the whole document. You still use regular expression to match the information within the row, but instead of matching against the entire document, you are only matching against this sub-element (or "instance"). This can trip users up who are trying to match data against the entire document. Only the portion of the text data in the extracted row is contained in the instance.

A single row matched on the document.

The match seen in the whole document's text data.

The text data in the instance of that matched row.
This is the text data you will use to pattern match each column's value. Notice you no longer have the control characters `\r` and `\n` at the end of the line. Those characters are part of the whole document's text data, but not the instance's

Just like there are many different extraction techniques to match each row, there are multiple ways table cell data is extracted from these row instances. Refer to the "How To" sections of this article for more information on how to set up a Data Table using Row Match and configure Data Columns to extract table data from a document.

Version Differences

The Row Match method is the oldest and original method of extracting table data in Grooper. As such, not much has changed from previous versions. However, there are two other table extract methods available in Grooper now, Header-Value and Infer Grid. These methods provide functionality either not available to Row Match, or make configuring table extraction much simpler for certain use cases.

Use Cases

The Row Match method is perfectly suited for fairly simple table structures. Spreadsheets with or without table lines are great candidates for the Row Match method. It is often the first "go-to" method of many Grooper designers due to its simplicity to set up and configure.

If you are processing a large number of the same report (or even very similar reports in some cases) over and over where the table structure remains consistent, Row Match is usually the best way to target and extract tabular data. It's easy to set up (often just using a single regular expression pattern). There are most often fewer objects to create and configure than other methods. This gives it the benefit of being computationally efficient as well.

Row Match has the added benefit of not relying on labels for each column header to model the table's structure. While most tables do have column headers, some don't. Row Match can produce results in those cases where other methods such as Header-Value rely on that information.

However, once the table's structure starts to change from document to document, a different approach may be needed. Different parties are going to structure tables however they want, which is well outside your control. Think of all the different ways an invoice can be structured. While the information you want is present in all the different tables, how that data is presented may not be consistent. Even just the column location changing can present problems for this method. A row extractor using a single pattern may not be able to do the job (or a complicated row extractor accounting for multiple row formats may need to be used). For these situations, the Header-Value method may be easier to configure and produce better results.

Optional data columns, where values may or may not be present in a cell can complicate things as well. Again, a simple row extractor using a single regex pattern may not do the trick. While a more complicated extractor may successfully extract the table's information, the Header-Value or Infer Grid methods may be simpler to set up and produce the same or even better results.

These are different Oil and Gas Production Reports from various sources. Each one organizes information differently into tables in different ways. Row Match would work just fine for each individual document. However, while the same information exists on each document, there's enough variability in the table structures that Row Match may not be suited for processing the whole document set (Header-Value usually produces a better result).

How To

There are multiple ways to use Row Match to extract tablular data. The following tutorials will give you insight on how to set up and use Row Match in a few different ways. However, before configuring the Row Match method, we must create a Grooper Data Table with Data Columns

Creating a Data Table in Grooper

PrereqsStep 1Step 2Step 3Step 4

Before you begin

A Data Table is a Data Element used to model and extract a table's information on a document. Just like other Data Elements, such as Data Fields and Data Sections, Data Tables are created as children of a Data Model. This guide assumes you have created a Content Model with a Data Model.

We will use the table below as our example for creating a Data Table.

Navigate to a Data Model

Using the Node Tree on the left side of Grooper Design Studio, navigate to the Data Model you wish to add the Data Table to. Data Tables can be created as children of any Data Model at any hierarchy in a Content Model.

Add a Data Table

Right click the Data Model object, mouse over "Add" and select "Data Table"

The following window will appear. Name the table whatever you would like and press "OK" when finished.

This creates a new Data Table object in the Node Tree underneath the Data Model.

Add Data Columns

Right click the Data Table object, mouse over "Add" and select "Data Column"

This brings up the following window to name the Data Column. When finished, press "OK" to create the object.

This creates a new Data Column object in the Node Tree underneath the Data Model.

Repeat Until Finished

Add as many columns as necessary to complete the table. For our example, we have a single Data Table with five Data Columns, each one named for the corresponding column on the document.

Using Row Match with Column Extractors

The first part of extracting information using the Row Match method is to create a Row Extractor to determine the table's structure. Once that is done, we can target and extract the information from each column. But how do we actually get the information out of the columns? There are a few ways. One is to use the "Column Extractor" property on each Data Column. This guide will demonstrate how to create a Row Extractor and use Column Extractors to extract information from a table.

PrereqsStep 1Step 2Step 3Step 4

A Data Table is a Data Element used to model and extract a table's information on a document. Just like other Data Elements, such as Data Fields and Data Sections, Data Tables are created as children of a Data Model. This guide assumes you have created a Content Model with a Data Model.

We will use the table below as our example. This is a production report filed with the Oklahoma Corporation Commission from an oil and gas company. The raw text data has already been extracted using OCR via the Recognize activity.

Add a Data Table

Create a Data Table with five Data Columns. The five columns for our example are "Operator Name", "Well Name", "Lease Number", "PC", and "Runs". Refer to the Creating a Data Table section above for more information on adding a Data Table to a Data Model.

Set the Extract Method

First, set the "Extract Method" property to "Row Match". (1) Select the Data Table object in the Node Tree, and (2) select the "Extract Method" property.

Using the dropdown list, select "Row Match".

Create the Row Extractor

The first part of configuring the Row Match method is creating a Row Extractor. The Row Extractor uses regular expression to determine the pattern of each row in the table. The extractor's "Type" can be "Internal" or "Reference". Choosing "Internal" will allow you to write a regular expression pattern straight from the property panel. "Reference" will allow you to point to an extractor built elsewhere in the Node Tree. Internal extractors are typically very simple patterns that don't need the extra properties available to Data Type extractors, such as collation, filtering or post-processing. Our case here is very simple. We will create the Row Extractor using an Internal extractor.

Expand the Row Match method's properties by double clicking "Extract Method". Press the ellipsis button next to "(empty pattern)". This will bring up Grooper's Pattern Editor to create a simple Internal extractor for the Row Match method.

Using the Pattern Editor, we will write a regular expression pattern to match each row of the table.

More specifically, we need to find something in one of these five columns to anchor off of. We need to target at least one piece of information we can use to pattern the whole row. Let's try and find the cell that has a fairly specific pattern to match. The "Operator Name" and "Well Name" seems extremely variable. The name of the well operator could be anything. It could be made of letters, numbers, special characters. Who knows? And the well name has a similar problem. "PC" looks like it's reliably two digits, but the "Lease Number" and "Runs" columns would also match a regex that is just two digits. The "Runs" column presents its own problem because not every row has a value in that column.

Information in the "Lease Number" column, however, has a very unique pattern to it. These leases always appear to be numbered as three digits followed by a dash, followed by six numbers, another dash, then one number, one more dash, and ends with four numbers. This pattern can be captured easily by the following regex pattern: \d{3}-\d{6}-\d-\d{4} Furthermore, this pattern will not match any of the other columns (While it's possible an operator's name could be "123-123456-1-1234 Resources Company", it's extraordinarily unlikely.)

Now that we have a portion of each row in the table matching, we just need to capture the rest of the row. To do this, let's switch over to the "Text" tab. This is the raw text data our regex pattern is using to find a match. From here you can see the control characters \r and \n at the end of every line of text. Looking at our rows here, we can see each row is on a new line of text. So, each row is between these two characters, \r\n. We can use \n and \r as anchors on either end of our pattern to expand the pattern and capture entire row.

We can use a negated character set to capture everything to the left of the lease number and another one to capture everything on the right. Since each row starts with a new line character (\n), everything to the left of the lease number is not a new line character. Since every row ends with a carriage return character (\r), everything to the right of the lease number is not a carriage return character. So, we can match the entire row with the following regular expression: [^\n]+\d{3}-\d{6}-\d-\d{4}[^\r]+

Press the "OK" button to finish editing the Row Extractor.

Now, we have our Row Extractor. With this single line of regular expression, we match the pattern of each of the 133 rows in this table. This is the part of Row Match extraction that gives us the table structure. Now we have the general idea of what the table looks like, row by row. But, we don't have any information populating our table in Grooper. Next, we need to extract that information and fill the Data Columns in our Data Table.

Set the Column Extractors

Each Data Column has a "Value Extractor" property that can be used to extract data from each row. This too can be either an "Internal" or "Reference" extractor.

Using Row Match with Named Groups

Prereqs

Using Row Match with Ordered Arrays

Prereqs