Difference between revisions of "Row Match (Table Extract Method)"

From Grooper Wiki
Jump to navigation Jump to search
 
Line 883: Line 883:
 
{|cellpadding=10 cellspacing=5
 
{|cellpadding=10 cellspacing=5
 
|-
 
|-
|style="width:40%" valign=top|1. There are dates.
+
|style="width:40%" valign=top|
 +
1. There are dates.
 
|[[File:Row Mach EOB 9.png|border]]
 
|[[File:Row Mach EOB 9.png|border]]
 
|-
 
|-
||style="width:40%" valign=top|
+
|style="width:40%" valign=top|
 
2. There are numbers, one to three digits long.
 
2. There are numbers, one to three digits long.
 
|[[File:Row Match EOB 10.png|border]]
 
|[[File:Row Match EOB 10.png|border]]

Latest revision as of 10:07, 5 May 2021

The Row Match method's property panel


Row Match is one of three Table Extraction methods available to Data Table Data Elements to extract information from tables on a document set. It uses regular expression pattern matching to determine a tables structure based on the pattern of each row and extract cell data from each column.


About

The method you use to extract information from tables is set on Data Table objects added to a Data Model. This is done using the "Extract Method" property.


Method select context.png


Once you select an Extract Method (in this case "Row Match"), you can see all the configurable properties available to that particular method by expanding the "Extract Method" property.


Row match property context.png


The configuration of Row Match is relatively simple. As you can see in the image below, there is only one property to configure, the "Row Extractor". Just like any other extractor, it can be an Internal pattern or a Reference to an extractor in the Node Tree.

Row match properties internal.png Row match properties reference.png

This extractor is made to find the general pattern of each row in a table. There are many ways to accomplish this. We will look at a simple example with a fairly simple solution. Examine the table below.

Simpletable.png

What we are trying to do is model the table's structure using regular expression. Each row has a pattern to it from left to right. First a date, followed by a five digit number, followed by a name of variable length, followed by a one to two digit number followed by a variable dollar amount. This is the table's structure, a series of rows containing this information, in this order. We can use regular expression to match that pattern of data. First, we might break it down further and look at patterns to target each column, one after the other. That could look like something below.

Regular expression Match found as seen on image Notes
\d{1,2}/\d{1,2}/\d{4} Row match col1.png This pattern is just one of may ways to find this specific date format. This pattern would need to be adjusted to find other formats such as "month-day-year".
\d{5} Row match col2.png A simple regex to only find a five digit number.
[A-Z]+ Row match col3.png This pattern also matches words outside of the row, but that's ok. We will put this all together and only match the row shortly.
\d{1,2} Row match col4.png Same with this pattern.
[$]\d{1,3}(,\d{3})?[.]\d{2} Row match col5.png Again, one of many ways to match currency values.

Now that we can match every cell in the row, we just need to put all that information together and match the whole row. We can enable "tab marking" in a pattern to insert the "\t" control character between segments of text separated by a large amount of whitespace (larger than a normal space character). We can use this tab character to make a regular expression pattern seen below to match the entire row.

Regular expression for the entire row \d{1,2}/\d{1,2}/\d{4}\t\d{5}\t\w+\t\d{1,2}\t[$]\d{1,3}(,\d{3})?[.]\d{2}
Each row matches the single expression
Row match whole table matched.png

Once you are able to match each row, the next part is to extract the data in that row corresponding to each row. This can be done in a number of ways, but an important thing to point out is what you're trying to match. Once Grooper finds a row using the Row Extractor, it creates an "instance" of that extracted row. You can think of this as a sub-element within the whole document. You still use regular expression to match the information within the row, but instead of matching against the entire document, you are only matching against this sub-element (or "instance"). This can trip users up who are trying to match data against the entire document. Only the portion of the text data in the extracted row is contained in the instance.

A single row matched on the document.
Row match whole table.png
The match seen in the whole document's text data.
Row match whole text2.png
The text data in the instance of that matched row.
This is the text data you will use to pattern match each column's value. Notice you no longer have the control characters \r and \n at the end of the line. Those characters are part of the whole document's text data, but not the instance's

Just like there are many different extraction techniques to match each row, there are multiple ways table cell data is extracted from these row instances. Refer to the "How To" sections of this article for more information on how to set up a Data Table using Row Match and configure Data Columns to extract table data from a document.

Use Cases: General

A spreadsheet well suited for Row Match. It is very structured. There's a detectable pattern for each row. Even if the column labels were not present, Row Match could extract this table's information, likely with a single regular expression pattern.

The Row Match method is perfectly suited for fairly simple table structures. Spreadsheets with or without table lines are great candidates for the Row Match method. It is often the first "go-to" method of many Grooper designers due to its simplicity to set up and configure.

If you are processing a large number of the same report (or even very similar reports in some cases) over and over where the table structure remains consistent, Row Match is usually the best way to target and extract tabular data. It's easy to set up (often just using a single regular expression pattern). There are most often fewer objects to create and configure than other methods. This gives it the benefit of being computationally efficient as well.

Row Match has the added benefit of not relying on labels for each column header to model the table's structure. While most tables do have column headers, some don't. Row Match can produce results in those cases where other methods such as Header-Value rely on that information.

However, once the table's structure starts to change from document to document, a different approach may be needed. Different parties are going to structure tables however they want, which is well outside your control. Think of all the different ways an invoice can be structured. While the information you want is present in all the different tables, how that data is presented may not be consistent. Even just the column location changing can present problems for this method. A row extractor using a single pattern may not be able to do the job (or a complicated row extractor accounting for multiple row formats may need to be used). For these situations, the Header-Value method may be easier to configure and produce better results.

Optional data columns, where values may or may not be present in a cell can complicate things as well. Again, a simple row extractor using a single regex pattern may not do the trick. While a more complicated extractor may successfully extract the table's information, the Header-Value or Infer Grid methods may be simpler to set up and produce the same or even better results.


These are different Oil and Gas Production Reports from various sources. Each one organizes information differently into tables in different ways. Row Match would work just fine for each individual document. However, while the same information exists on each document, there's enough variability in the table structures that Row Match may not be suited for processing the whole document set (Header-Value usually produces a better result).
50939766 Page 1.png Spreadsheet with lines.png 50946163.png 50946164 Page 1.png


The Infer Grid method also has some advantages over Row Match. There are some specialized use cases, such as reading OMR checkboxes in tables and reprocessing table cells using a secondary OCR profile, where Infer Grid does things the other two methods simply can’t. Infer Grid also performs well when table line information can be saved to a page’s layout data.

How To

There are multiple ways to use Row Match to extract tablular data. The following tutorials will give you insight on how to set up and use Row Match in a few different ways. However, before configuring the Row Match method, we must create a Grooper Data Table with Data Columns

Creating a Data Table in Grooper

Before you begin

A Data Table is a Data Element used to model and extract a table's information on a document. Just like other Data Elements, such as Data Fields and Data Sections, Data Tables are created as children of a Data Model. This guide assumes you have created a Content Model with a Data Model.

We will use the table below as our example for creating a Data Table.

Simpletable.png

Navigate to a Data Model

Using the Node Tree on the left side of Grooper Design Studio, navigate to the Data Model you wish to add the Data Table to. Data Tables can be created as children of any Data Model at any hierarchy in a Content Model.


Create a data table 1.png

Add a Data Table

Right click the Data Model object, mouse over "Add" and select "Data Table"


Create a data table 2.png


The following window will appear. Name the table whatever you would like and press "OK" when finished.


Create a data table 3.png


This creates a new Data Table object in the Node Tree underneath the Data Model.


Create a data table 4.png

Add Data Columns

Right click the Data Table object, mouse over "Add" and select "Data Column"


Create data table 5.png


This brings up the following window to name the Data Column. When finished, press "OK" to create the object.


Create data table 6.png


This creates a new Data Column object in the Node Tree underneath the Data Model.


Create data table 7.png

Repeat Until Finished

Add as many columns as necessary to complete the table. For our example, we have a single Data Table with five Data Columns, each one named for the corresponding column on the document.


Create data table 8.png

Using Row Match with Column Extractors

The first part of extracting information using the Row Match method is to create a Row Extractor to determine the table's structure. Once that is done, we can target and extract the information from each column. But how do we actually get the information out of the columns? There are a few ways. One is to use the "Value Extractor" property on each Data Column (These are often referred to as a "Column Extractors"). This guide will demonstrate how to create a Row Extractor and use Column Extractors to extract information from a table.

! Some of the tabs in this tutorial are longer than the others. Please scroll to the bottom of each step's tab before going to the step.

Before you begin

A Data Table is a Data Element used to model and extract a table's information on a document. Just like other Data Elements, such as Data Fields and Data Sections, Data Tables are created as children of a Data Model. This guide assumes you have created a Content Model with a Data Model.

We will use the table below as our example. This is a production report filed with the Oklahoma Corporation Commission from an oil and gas company. The raw text data has already been extracted using OCR via the Recognize activity.


Row match cols 1.png

Add a Data Table

Create a Data Table with five Data Columns. The five columns for our example are "Operator Name", "Well Name", "Lease Number", "PC", and "Runs". Refer to the Creating a Data Table section above for more information on adding a Data Table to a Data Model.


Row match cols 2.png

Set the Extract Method

First, set the "Extract Method" property to "Row Match". (1) Select the Data Table object in the Node Tree, and (2) select the "Extract Method" property.


Row match cols 3.png


Using the dropdown list, select "Row Match".


Row match cols 4.png

Create the Row Extractor

The first part of configuring the Row Match method is creating a Row Extractor. The Row Extractor uses regular expression to determine the pattern of each row in the table. The extractor's "Type" can be "Internal" or "Reference". Choosing "Internal" will allow you to write a regular expression pattern straight from the property panel. "Reference" will allow you to point to an extractor built elsewhere in the Node Tree. Internal extractors are typically very simple patterns that don't need the extra properties available to Data Type extractors, such as collation, filtering or post-processing. Our case here is very simple. We will create the Row Extractor using an Internal extractor.

Expand the Row Match method's properties by double clicking "Extract Method". Press the ellipsis button next to "(empty pattern)". This will bring up Grooper's Pattern Editor to create a simple Internal extractor for the Row Match method.


Row match cols 5.png


Using the Pattern Editor, we will write a regular expression pattern to match each row of the table.


Row match cols 6.png


More specifically, we need to find something in one of these five columns to anchor off of. We need to target at least one piece of information we can use to pattern the whole row. Let's try and find the cell that has a fairly specific pattern to match. The "Operator Name" and "Well Name" seems extremely variable. The name of the well operator could be anything. It could be made of letters, numbers, special characters. Who knows? And the well name has a similar problem. "PC" looks like it's reliably two digits, but the "Lease Number" and "Runs" columns would also match a regex that is just two digits. The "Runs" column presents its own problem because not every row has a value in that column.

Information in the "Lease Number" column, however, has a very unique pattern to it. These leases always appear to be numbered as three digits followed by a dash, followed by six numbers, another dash, then one number, one more dash, and ends with four numbers. This pattern can be captured easily by the following regex pattern: \d{3}-\d{6}-\d-\d{4} Furthermore, this pattern will not match any of the other columns (While it's possible an operator's name could be "123-123456-1-1234 Resources Company", it's extraordinarily unlikely.)


Row match cols 7.png


Now that we have a portion of each row in the table matching, we just need to capture the rest of the row. To do this, let's switch over to the "Text" tab. This is the raw text data our regex pattern is using to find a match. From here you can see the control characters \r and \n at the end of every line of text. Looking at our rows here, we can see each row is on a new line of text. So, each row is between these two characters, \r\n. We can use \n and \r as anchors on either end of our pattern to expand the pattern and capture entire row.


Row match cols 8.png


We can use a negated character set to capture everything to the left of the lease number and another one to capture everything on the right. Since each row starts with a new line character (\n), everything to the left of the lease number is not a new line character. Since every row ends with a carriage return character (\r), everything to the right of the lease number is not a carriage return character. So, we can match the entire row with the following regular expression: [^\n]+\d{3}-\d{6}-\d-\d{4}[^\r]+


Row match cols 9.png


Press the "OK" button to finish editing the Row Extractor.


Row match cols 10.png


Now, we have our Row Extractor. With this single line of regular expression, we match the pattern of each of the 133 rows in this table. This is the part of Row Match extraction that gives us the table structure. Now we have the general idea of what the table looks like, row by row. But, we don't have any information populating our table in Grooper. Next, we need to extract that information and fill the Data Columns in our Data Table.


Row match cols 11.png

Extracting Table Data with Column Extractors

Each Data Column has a "Value Extractor" property that can be used to extract data from each row. This too can be either an "Internal" or "Reference" extractor.

FYI Version 2.8 added several new extraction methods available to Data Fields and Data Columns besides "Internal" and "Reference". "Internal" has been renamed "Text Pattern", but it functions exactly like an Internal pattern.

We will start with the "Operator Name" column. We will use an extractor to find the portion of the row containing the operator's name ("Samson Resources Company" for example). (1) Select the "Operator Name" Data Column in the Node Tree. (2) Select the "Value Extractor" property and choose (3) "Text Pattern" from the dropdown list.


Row match cols 12.png

Instances vs Full Text

Before we start editing a regex pattern to match the operator name on each row, we need to look at what text we are actually matching against. Once the Row Extractor runs, an instance for each row is created. So, for our table, we have 133 rows. 133 instances are created, one for each row. If we go back to the pattern we created for the Row Extractor, we can see what is contained in an instance for a single row. Select a result from the Results list and choose "Inspect Instance..."


Row match cols 13.png


This brings up the "Instance Viewer" window, seen below.


Row match cols 14.png


The important thing to keep in mind is the Value Extractor runs against each row's instance, not the entire document. Switch to the "Text View" tab. Notice the text here is only limited to the information extracted for that row. The control characters \r and \n do not appear, as they do with the full document's text data. This is important to keep in mind. Since those control characters do not exist in the instance, you will not be able to match them using regular expression. Those control characters are outside of the instance. Only the text data inside the instance will be matched by the Data Column's Value Extractor.


Row match cols 15.png

! The \n and \r control characters are not contained in this instance. So, they can not be used as anchors to find text at the beginning or end of the text. However, the regex characters ^ and $ can be used instead.
^ is a special character that will match the beginning of a string. Instead of using \n to match the beginning of a line in the full document's text, we can use ^ to match the beginning of the instance.
$ is a special character that will match the beginning of a string. Instead of using \r to match the end of a line in the full document's text, we can use $ to match the beginning of the instance.


Using Tabs to Indicate Column Breaks

Let's go back to the "Operator Name" Data Column. Expand the "Value Extractor" sub-properties by double clicking it. Select the "Pattern" property and press the ellipsis button at the end to bring up the Pattern Editor.


Row match cols 16.png


Now, we just need to write a regex pattern to match the operator name portion of the row instance. First things first, it will be helpful if we can use something in the text data to note breaks between columns. Generally, columns are separated either by table lines or large amount of whitespace (at least larger than a normal space between words). A large amount of whitespace can be represented in text by a tab character (\t). By default, Grooper leaves tab characters out of the text data. A space between words is just a space regardless of the size. However, we can insert these characters into the text data by enabling tab marking in the pattern's preprocessing options.

Tab characters can be used to represent column breaks in a table. This gives us something we can pattern match as an indicator where one column stops and another begins. With Tab Marking enabled...
Col breaks to tabs 1.png Col breaks to tabs 2.png
...column breaks in the table on the document turn into... ...tab characters in the text flow.

Switch to the "Properties" tab of the Pattern Editor.


Row match cols 17.png


Double click the "Preprocessing Options" property to expand its subproperties. Select "Tab Marking" and change it to "Enabled". This will insert tab characters (\t) where there is a larger than normal amount of whitespace between letters.


Row match cols 18.png


Switch back to the "Pattern Editor" tab. Now that we have tab characters available in the text, we can use them to help target text in individual columns. The tab character will be the indicator of where one column stops and another begins.

Pattern Matching the Operator Name

Keep in mind we're trying to write a pattern to find the operator names of each row. Let's think about what we know about the operator names. First, they're going to be highly variable. Company names do not have to follow strict rules. They can contain any number of words, numbers, or other characters in very unpredictable ways. Second, they're in the first column of the row. Third, regardless of what text comprises an operator's name, it's not going to be a control character, like a \t tab character.

Knowing this, we can use a negated character set of variable length matching everything not a tab character or in regex: [^\t]+

That takes care of matching the operator names themselves, but it also matches everything else between a tab. We need to narrow this down to just the operator name.


Row match cols 19.png


We could take a similar approach we did with our Row Extractor. Knowing the Operator Name is the first column, we know it is always at the start of a new line, which means there will always be a \n new line control character before the operator name. We could use a look ahead pattern to limit our results only to values that have that \n character directly before them.


Row match cols 20.png


However, we are not matching the full text like we see in the Pattern Editor. We are matching each extracted row's individual instance. What we are actually matching is what's seen below for every row.


Row match cols 21.png


! The \n and \r control characters are not contained in this instance. So, they can not be used as anchors to find text at the beginning or end of the text. However, the regex characters ^ and $ can be used instead.
^ is a special character that will match the beginning of a string. Instead of using \n to match the beginning of a line in the full document's text, we can use ^ to match the beginning of the instance.
$ is a special character that will match the beginning of a string. Instead of using \r to match the end of a line in the full document's text, we can use $ to match the beginning of the instance.

Since the \n character is not a part of the row's instance, using it in the look ahead will not produce a result. However, there are special characters in regular expression to mark the start or end of a string of text. The carrot ^ can be used to anchor of the beginning of a string in an instance. The dollar sign $ can be used to anchor of the end of a string in an instance.

It's easy to forget Column Extractors are matching against row instances and not the entire text. There are similar situations in Grooper that can trip users up where they think they are matching against full text but are actually matching against an instance. Because of this, many designers consider it best practice to use "\n|^" (meaning a match for a both a new line in the full text or the beginning of a string in an instance) in every case you would use just "\n".

Using "\n|^" as a look ahead pattern here will functionally match something at the beginning of a row's line, whether you're looking at the full text or a single row's instance.

TIP The regex pattern \n|^ will match the beginning of a new line or the beginning of an instance. When dealing with tables, get in the habit of using this pattern to match both. This way, the pattern will match when running whether you are attempting to match the full text or text inside an instance.


Row match cols 22.png


This pattern will now successful extract the operators' names from the Operator Name column. Also, note there are a lot of matches outside of the table. That's the beauty about instancing. Since the column extractor is only looking at the instances produced by the Row Extractor, everything else on the page is ignored. This allows use much simpler regular expression. The effort of narrowing down what we want to match has already been partly done by creating an instance of just the row of text.


Row match cols 23.png


Press the "OK" button to finish editing the pattern.


Row match cols 24.png


Press the "Save" button to finish configuring the Data Column.


Row match cols 25.png


Configure the Remaining Column Extractors

The remaining table data can be extracted using Value Extractors on the remaining four Data Columns. These will also use "Text Pattern" as their extracting method.

Well Name

The well name can be found taking a similar approach as the Operator Name. We will use the same [^\t]+ pattern with an expanded look ahead pattern to skip the operator name from the first column. To the \n|^ pattern will will add [^\t]+\t to expand it to include the beginning of the row, the first amount of text and the first tab character in the look ahead. Note, this pattern also uses tab characters. So, Tab Marking needs to be enabled in the Preprocessing Options.


Row match cols 26.png


Lease Number

The lease number was easily matched using a unique pattern in our Row Extractor. We can use the same pattern to match the individual lease numbers in the row instances, without the use of a look ahead pattern.


Row match cols 27.png


PC

There's a variety of ways to target information from these rows. The PC Data Column is for the product code, which can be either "01" or "02". Since we know the product code comes after the lease number in the table, we could use the lease number's pattern followed by a tab character as the look ahead pattern. Or, we could be even lazier. We can produce a match without even bothering to turn on Tab Marking. Without Tab Marking, the whitespace on either side of the product code is just a simple space. Nowhere else in the text of a row do we see the pattern of a space followed by "01" or "02" followed by a space. Knowing this, we can make a super simple pattern to find the product code 0[12] with the space control character \s as the look ahead and look behind pattern.


Row match cols 28.png


Side Note: Using Local Lexicons for Result Translation
This pattern works perfectly. It returns either "01" or "02" but do those numbers mean? Let's say "01" means the product is crude oil and "02" means the product is condensate oil. What if the information we want to extract (and specifically what we want to export) are the words "crude oil" and "condensate oil" and not the numbers? This gives us a great opportunity to do some quick translation of these terms in Grooper. Switch to the "Properties" tab, select the "Lookup Options" property and press the ellipsis button at the end of the property.


Row match cols 29.png


This brings up the "Lookup Options" property window. Expand the "Vocabulary" property and select "Local Entries" Press the ellipsis button at the end of the property.


Row match cols 30.png


This brings up a List Editor to write a custom Lexicon used only by the pattern. We can write out some quick translation here by writing out "01=Crude Oil" and "02=Condensate Oil". Press the "OK" button when finished.


Row match cols 31.png


Last, change the "Enable Translation" property from "False" to "True". This will actually perform the translation,changing the values from "01" and "02" to "Crude Oil" and "Condensate Oil". Press the "OK" button when finished.


Row match cols 32.png


Now, the original extracted value is translated according to what was set in the Local Entries. So, if the extractor returns "01", it gets changed to "Crude Oil" and "02" is changed to "Condensate Oil".


Row match cols 33.png


Runs

The Runs column is unique for two reasons. One, it's an optional column. Sometimes there is a value in these cells, sometimes not. Two, and more importantly, it's at the end of the row. Remember, we are matching the row instances and not the full text. Just like we couldn't use \n to anchor the first column's information to the beginning of a new line, we can't use \r to anchor the final column to the end of the line. However, since we are dealing with instances we can use the end of string character in regular expression, which is the dollar sign $

We can write a pattern to match a variety of currency values using the following regex: \d{1,3}(,\d{3}){0,2}[.]\d{2} with $ as the look behind pattern, therefore only matching a currency value at the end of the row instance. And, if it doesn't produce a match, that cell will simply be left blank.


Row match cols 34.png


FYI Just like some Grooper designers will get in the habit of using \n|^ to signify the beginning of a line or start of string, \r|$ can be used for the end of a line or end of string. Getting into this habit will match whether in an instance or using the full document text data.

Test the Extraction

Now, Grooper knows how to find the table's structure using the Row Extractor and knows how to extract each cell's values according to the Data Column's Value Extractors (or Column Extractors for short). Let's see how we did. Navigate back to the Data Table object and press the "Test Extraction" button.


Row match cols 35.png


This shows us what data will be extracted during the Extract activity. As you can see below, all the information from the table is successfully extracted, for all 133 rows.


Row match cols 36.png

Using Row Match with Named Groups

While the example above works, there is a much more elegant way to do this in Grooper. "Named groups" allow users to target portions of a regular expression pattern and call it out in different ways. Using named groups, we can extract the exact same data from the exact same document without configuring Column Extractors. Only a single regular expression pattern will be written, that for the Row Extractor.

! Some of the tabs in this tutorial are longer than the others. Please scroll to the bottom of each step's tab before going to the step.

Before you begin

A Data Table is a Data Element used to model and extract a table's information on a document. Just like other Data Elements, such as Data Fields and Data Sections, Data Tables are created as children of a Data Model. This guide assumes you have created a Content Model with a Data Model.

Just like the previous guide, we will use the table below as our example. This is a production report filed with the Oklahoma Corporation Commission from an oil and gas company. The raw text data has already been extracted using OCR via the Recognize activity.


Row match cols 1.png

Add a Data Table

Create a Data Table with five Data Columns. These are the same Data Columns used in the previous example. The five columns for our example are "Operator Name", "Well Name", "Lease Number", "PC", and "Runs". Refer to the Creating a Data Table section above for more information on adding a Data Table to a Data Model.


Row match cols 2.png

Set the Extract Method

First, set the "Extract Method" property to "Row Match". (1) Select the Data Table object in the Node Tree, and (2) select the "Extract Method" property.


Row match cols 3.png


Using the dropdown list, select "Row Match".


Row match cols 4.png

Create the Row Extractor

Like the previous example, the first part of configuring the Row Match method is creating a Row Extractor. The Row Extractor uses regular expression to determine the pattern of each row in the table. The extractor's "Type" can be "Internal" or "Reference". Choosing "Internal" will allow you to write a regular expression pattern straight from the property panel. "Reference" will allow you to point to an extractor built elsewhere in the Node Tree. Internal extractors are typically very simple patterns that don't need the extra properties available to Data Type extractors, such as collation, filtering or post-processing. Our case here is very simple. We will create the Row Extractor using an Internal extractor.

Expand the Row Match method's properties by double clicking "Extract Method". Press the ellipsis button next to "(empty pattern)". This will bring up Grooper's Pattern Editor to create an Internal extractor for the Row Match method.


Row match cols 5.png


This brings up the Pattern Editor window to write a regex pattern.


Row match named groups 1.png


The pattern we used in the previous example was as follows: [^\n]+\d{3}-\d{6}-\d-\d{4}[^\r]+. This did model the pattern of each row, and we were able to use the Data Columns' Value Extractors to return each column's information, but we had to configure each individual Data Column to do so. Named groups will allow us to use a single regular expression pattern, set on the Row Extractor, and use portions of the regex to fill in our columns' values automatically, without even touching the Data Columns' properties.

To do this, we first need to write a more specific regular expression pattern. The general idea is to write a pattern that matches the pattern of each column, one after the other, while also matching the entire row. The first thing we might want to do is enable Tab Marking. This will insert the \t tab character into the raw text data wherever there's a larger than normal amount of whitespace between characters (larger than a space character between words). This gives us something we can match in the text flow that functionally "means" a column break.


Tab characters can be used to represent column breaks in a table. This gives us something we can pattern match as an indicator where one column stops and another begins. With Tab Marking enabled...
Col breaks to tabs 1.png Col breaks to tabs 2.png
...column breaks in the table on the document turn into... ...tab characters in the text flow.

Switch to the "Properties" tab of the Pattern Editor.


Row match cols 17.png


Double click the "Preprocessing Options" property to expand its subproperties. Select "Tab Marking" and change it to "Enabled". This will insert tab characters (\t) where there is a larger than normal amount of whitespace between letters.


Row match cols 18.png


Switch back to the "Pattern Editor" tab. Now that we have tab characters available in the text, we can use them to help target text in individual columns. Essentially, what we will do is combine all the individual patterns we wrote for the Column Extractors in the previous example in one large regex. Lets start with the Operator Name (captured by [^\t]+ followed by \t indicating the column break) with a look ahead pattern forcing our pattern to only match something on a new line or the start of a string (\n|^)


Row match named groups 2.png


Next, we will match the Well Name column with [^\t]+\t.


Row match named groups 3.png


Then, comes the Lease Number captured by \d{3}-\d{6}-\d-\d{4}\t.


Row match named groups 4.png


Then, the PC (or Product Code) column. For document the following regex works \d{2}\t. However, keep in mind the Runs column may have a value in it and it may not depending on the row. Where the column is empty, there is a hyphen in the cell. However, it's more than possible that cell may just be left empty (or OCR may not have picked up that character). If that is the case, a tab character \t would not follow the two digits in the text flow. Instead, those two digits would be at the end of the line. So, the next character would be the carriage return character \r. Furthermore, a line of text at the end of the page does not terminate in a carriage return. The end of a page is indicated by a the form feed character \f. Last but not least, remember our Row Extractor is creating instances out of every row of text. The end of a line of text in an instance is indicated by the end of string character $

All of that to say, there are several different characters that could indicate we've come to the end of this column, including \t \r \f and $. We will account for all these possibilities by using the following regex ([\t\r\f]|$) instead of just \t. So it match any of the control characters in the character set ([\t\r\f]) or (|) the end of string character ($).

Putting it all together it comes to \d{2}([\t\r\f]|$)


Row match named groups 5.png


Last comes the Runs column. The Runs column is unique in that it is an optional column. Data may or may not be present in it. We can account for this using the ? character in regular expression. We will create a group around the regex capturing the dollar amount by putting it in parenthesis, and making it optional by including the ? character after it.

The regex to match the currency values is \d{1,3}(,\d{3})?[.]\d{2} and we make it optional by using (\d{1,3}(,\d{3})?[.]\d{2})?


Row match named groups 6.png


This single pattern is matching each row, accomplishing the primary goal of a Row Extractor. Now that we have a Row Extractor, how do we use named groups to fill in our column values?


Creating Named Groups for Fun and Profit

Next we are going to create five named groups in our regex pattern, one for each column. Each group will be named after the Data Column its corresponding regex pattern captures. The regex capturing the operator names will be named after the "Operator Name" Data Column. The regex capturing the well names and numbers will be named after the "Well Name" Data Column, and so on. The idea behind this method is to use the values returned by each group to populate values for each Data Column.

Our first column is "Operator Name". The first line of our Row Extractor captures the operator names with [\t]+ (Note this does not include the \t character we used to anchor values between column breaks. We just want the text that makes up an operator's name). This will be our first group. We've already created a group in the previous step when we put the regex to find the Runs values in parenthesis to make that part of our pattern optional with the ? character. The difference here is we're naming the group.

Named groups can be created and named using the following syntax: (?<Group_Name>regex pattern)

In between the ?< and > characters is where you will name the group. After the > is where the regex goes. And the whole thing is put between parenthesis.

We could type all this out, but Grooper has a shortcut. Highlight this portion of the regular expression with your mouse and right click it. At the bottom of the list of options is "Create Group". Click this option.


Row match named groups 7.png


This places the selected regex in a named group and puts the cursor in between the ?< and > characters where we will name the group.


Row match named groups 8.png


There are two key things to remember.

  1. We want the group to match the corresponding Data Column's name exactly.
  2. Spaces should be replaced with an underscore _


Row match named groups 9.png


FYI There is also a keyboard shortcut for creating a named group. It is Ctrl + G

Now we have a reference for part of the whole regular expression that we can use elsewhere.

Next, we need to create groups for the rest of our Data Columns. For the same thing we did for the Operator Name column, we will do for the remaining four Data Column. Keep in mind, we want just the values, not any control characters used to anchor those values in one column position or another.


Row match named groups 11.png


You may be asking yourself, why do the names need to match the Data Columns? What are we really doing here? We're making instances! We've made individual sub-instances of the whole pattern, one for every named group. Right click one the results and select "Inspect Instance..."


Row match named groups 12.png


For each of our five named groups, we now have five sub-instances (or four for rows that don't have a Runs value). Each instance contains information from its part of the regular expression pattern. Ultimately, the values contained in these instances will populate the individual cells of our Data Table.


Row match named groups 13.png


And for the real magic, since these instance names match our Data Column names, we don't need to configure any extraction on the Data Columns. By default, Data Columns will look for instances in a Row Extractor to populate values. If the Data Column finds an instance with the same name, it will return the instance's value!

By using named groups, and naming them after our Data Table's Data Columns, the portion of the row the group captures will populate the cell in the corresponding Data Column. Let's see that in action next.

Test the Extraction

Select the Data Table in the Node Tree and press the "Test Extraction" button.


Row match named groups 14.png


Success! Grooper successfully extracts all 133 table rows, using only the Row Extractor with named groups.


Row match named groups 15.png

Use Cases: Deep Dive

In many ways, the Row Match table extraction method is best suited for fairly simple table structures. It's reliance on pattern matching tends to lend it to successful extraction of basic tables whose information is reliably repeatable. Row Match can start to run into problems with tables that are missing values in cells.

However, with creative use of extractors to establish a row's structure, Row Match can actually solve a variety of issues the other extraction methods simply can't. The example below is a table from an Explanation of Benefits (EOB) form, that has a variety of optional columns. On the face of it, it looks like the Row Match method simply can't get the job done. However, as we will see, creatively using Key-Value Pair collation in combination with Exclusion extractors allows us to successfully extract the table's data.

Special thanks to BIS employee Jack Clark for this example.

You can find a fully configured version of this Content Model and a Batch with the example document by downloading the zip file linked below. It can be imported into any 2.80 Grooper environment.


Before you begin

This example makes heavy use of Key-Value Pair collated Data Types and Exclusion Extractors. We will assume you are familiar with this functionality.

We will also make a passing mention of the Header-Value table extraction method.

Identifying the pain points: Missing Cells

The primary problem for using the Row Match method is going to be all of the optional table columns. This will leave us with several empty cells. Potentially, any cell besides the date could be empty.

For this example, we will use the mocked up Explanation of Benefits (EOB) form seen here.

Row Match EOB 1.png

Note: We went to extreme lengths on the number of missing cells for this example. This may or may not accurately reflect how many missing cells would be missing in the "real world". However, this gets to the heart of the problem. What do you do when almost any column is optional?

Row Match EOB 2.png

Row Match extracts data from a table by identifying the pattern of each row. If every row looked like the first row, without any empty cells, this would be an easy solve. You could easily map out that table structure using a simple regular expression with named groups named according to the table columns in your Data Table.

Row Match EOB 3.png

This pattern sucessfully matches and extracts the first table row.

Value Pattern

(?<Date_of_Svc>\d{1,2}/\d{1,2}/\d{2})\t
(?<Num_of_Svcs>\d{1,2})\t
(?<Revenue_Procedure_Code>[a-z]\d{4}-\d{2}-[a-z]{2}-[a-z]{2})\t
(?<Payment_Code>\d{3})\t
(?<Provider_Charge>\d{1,3}(\d{3})?\.\d{2})\t
(?<Our_Allowance>\d{1,3}(\d{3})?\.\d{2})\t
(?<Non_Chargeable_Amount>\d{1,3}(\d{3})?\.\d{2})\t
(?<Non_Chg_Code>\d{2})\t
(?<Member_Liability_Amount>\d{1,3}(\d{3})?\.\d{2})\t
(?<Mem_Liab_Code>\d{2})\t
(?<Other_Amount>\d{1,3}(\d{3})?\.\d{2})\t
(?<Amount_Paid>\d{1,3}(\d{3})?\.\d{2})\t?
(?<Message_Code>[^\n]+\r\n[^\n]+)

Lookahead Pattern

\n

Lookbehind Pattern

\r

However, as seen in the image to the right, it only gets the first row, not the full table.

Row Mach EOB 4.png

One potential solution would be to write another format for the next row with the missing cell's pattern removed from the larger row pattern. We could try something like the configuration listed below. Essentially it is the same as the one above, just with the named group for the "Other Amount" column removed.

Value Pattern

(?<Date_of_Svc>\d{1,2}/\d{1,2}/\d{2})\t
(?<Num_of_Svcs>\d{1,2})\t
(?<Revenue_Procedure_Code>[a-z]\d{4}-\d{2}-[a-z]{2}-[a-z]{2})\t
(?<Payment_Code>\d{3})\t
(?<Provider_Charge>\d{1,3}(\d{3})?\.\d{2})\t
(?<Our_Allowance>\d{1,3}(\d{3})?\.\d{2})\t
(?<Non_Chargeable_Amount>\d{1,3}(\d{3})?\.\d{2})\t
(?<Non_Chg_Code>\d{2})\t
(?<Member_Liability_Amount>\d{1,3}(\d{3})?\.\d{2})\t
(?<Mem_Liab_Code>\d{2})\t
(?<Amount_Paid>\d{1,3}(\d{3})?\.\d{2})\t?
(?<Message_Code>[^\n]+\r\n[^\n]+)


However, note this matches both the row that doesn't have "Other Amount" value as well as a row that doesn't have an "Amount Paid" value.

Row Mach EOB 5.png

Ultimately, this is because both values are currency values, matched by the same portion of our regex. At this point the Row Match method gets "confused" in that it doesn't have the awareness you do that you want one currency amount in one column and one currency amount in another.

When it comes time to extract this table, the value gets placed in the wrong column.

So, regardless of what we do next, this simple single-lined regex approach to match these table rows is not going to cut it.

Row Mach EOB 6.png


Click here to return to the top.

Identifying the pain points: Unreliable headers

At second glance, this could be a good candidate for the Header-Value table extraction method. In brief, the Header-Value method uses the location of column labels to determine where data in the underlying rows are.

However, this table's headers are not properly formatted to align with the cells of the table. This gives us completely inaccurate results as we see to the right. Only the "Date of Svc" column extracts reliably. The other columns are either unpopulated or misreading the data inside each row below.

Row Mach EOB 7.png

You can visualize this a little better if you extend the boundaries of each header down to the end of the page. As you can see, the corresponding cell's location does not quite match up with the positions of those headers. Those header values can only "see" the portion of text contained within the left and right boundaries of its width.

So, in this case, since those header locations don't align properly to the cell locations in the rows below, the data is not pulled from the table correctly.

Row Mach EOB 8.png


The solution presented in the next tabs accounts for this issue when header positions and widths are not reliable representations of the table's structure, rendering Header-Value ineffective. Instead, we will configure a more creative Row Extractor to successfully extract this table's information using the Row Match method.

Click here to return to the top.

Getting ready

We will ultimately use Key-Value Pair collated Data Types to establish the row structure of this table. Knowing each row will have a date, we will use this as the anchor for each row. From this anchor's position, we will find each cell in the remaining columns.

In a very general sense, we will create a series of Key-Value Pairs using the date on each row as the "key" to locate the corresponding value in each successive column.

As seen in the pictures to the right, the date (or Key), outlined in blue, will be our positional anchor to find each column's value, outlined in green.

Row Match EOB 14.png

Row Match EOB 15.png

Row Match EOB 16.png

There are five basic data patterns to this table to consider when building our extraction logic.

1. There are dates.

Row Mach EOB 9.png

2. There are numbers, one to three digits long.

Row Match EOB 10.png

3. There are currency values.

Row Match EOB 11.png

4. There is the "Revenue/Procedure Code" column with a specific pattern to each code ([a-z][0-9]{4}-[0-9]{2}-[a-z]{2}-[a-z]{2} would be the regex).

Row Match EOB 12.png

5. Last, there is the "Message Code" column.

Row Match EOB 13.png

We've built three basic extractors for the date, currency, and digits (1 to 3 digits long). We will use these extractors several times throughout the course of this build.

Row Match EOB 17.png

Click here to return to the top.

Build the Key-Value Pairs for Each Column: Starting with "Num of Svcs"

We will end up with twelve (12) Key-Value Pair collated Data Types, one for each column besides the "Date of Svc" column.

1. Starting with the "Num of Svcs" column, we've created a Data Type with two children Data Types, one named "KEY" and one named "VALUE". We will eventually create Key-Value Pairs for all subsequent column values, each having a "KEY" and "VALUE" child Data Type.

Row Match EOB 18.png

2. Using the Referenced Extractors property, the "KEY" Data Type will reference the "Generic Date" extractor created in Step 1 to locate the date in each row. All subsequent Key-Value Pairs' "Key" extractor will also reference this "Generic Date" extractor.

Row Match EOB 19.png

3. The "VALUE" Data Type will reference the corresponding extractor that matches the data in the "Num of Svcs" collumn. In this case, these are single digits in this column. So, we will reference the "1 to 3 digits" extractor.

Row Match EOB 20.png

4. Configure the parent Data Type (named "KVP-H - Num of Svcs") Collation property settings.

  • Change this from Individual to Key-Value Pair
  • Change Horizontal Layout from Disabled to Enabled
Row Match EOB 21.png

This works great up until situations like the highlighted rows. The Data Type goes too far from the key, past the "Num of Svcs" column, and grabs data from another column. To avoid this issue, we will simply use the Maximum Distance property of the Horizontal Layout sub-properties.

Row Match EOB 22.png

5. Set the Maximum Distance property to a length that does not go past the numbers in the "Num of Svcs" column. Here, 0.5 inches works.

Using the Maximum Distance property on these Key-Value Pairs will ensure the extractor does not go beyond the column's location to find data matched by the Value extractor.

Row Match EOB 23.png

6. Go ahead and create a Key-Value Pair Data Type for the "Revenue/Procedure Code" column.

  • The "KEY" child extractor will again reference the "Generic Date" extractor
  • The "VALUE" child extractor will use a Data Format or Pattern to find this code. These codes have a fairly unique pattern, which will not be repeated in any of the other columns. Because no other column will be extracted with this pattern, we can keep the Maximum Distance property blank for this extractor. See below for the regex used.
    • [a-z][0-9]{4}-[0-9]{2}-[a-z]{2}-[a-z]{2}
Row Match EOB 24.png

Click here to return to the top.

Build the Key-Value Pairs for Each Column: Ignoring previous column data

Next focus on the "Payment Code" column.

1. Copy the Key-Value Pair from Step 2 and rename it "KVP-H - Payment Code". This column is roughly 2 inches from our key (the date in the first column). Set the Maximum Distance property under the Collation properties to 2 inches.

Stopping here, you can see we have some problems. The Key-Value Pair collation provider will essentially return the value returned by the Value extractor that is physically nearest to the Key extractor's result.

The value we want falls in the fourth column. However, the Value extractor is simply returning numbers one to three digits long. Columns two and three are closer to the key (the date in the first column) and happen to match the Value extractor's pattern. So, what do we do to remove these hits as valid results?

We used the Maximum Distance property to exclude results that come after a column. We need to exclude results that come before a column.

Row Match EOB 25.png

To do this we will take advantage of the Value extractor's Exclusion Extractor property and exclude the results we've already obtained from the previous two columns extractors (the "KVP-H - Num Svcs" and "KVP-H - Revenue Procedure Code" extractors).

2. First we will create a single extractor that references these two extractors.

  • We will create a Data Type named "EXCL - Num of Svcs, Revenue Procedure Code".
  • Using the Referenced Extractors property, we will add the "KVP-H - Num of Svcs" and "KVP-H - Revenue Procedure Code" extractors.
Row Match EOB 26.png

This will return the values we've already located in those two columns.

Row Match EOB 27.png

3. On the VALUE child extractor, reference this extractor on the Exclusion Extractor property. This will remove the results from the previous two columns as valid Value results.

Row Match EOB 28.png

Now, the parent Key-Value Pair extractor picks up the correct values for the "Payment Code" column, ignoring the preceding columns results before it.

Row Match EOB 29.png

You can continue this line of logic all the way down the line for each subsequent column, ignoring values in the columns before it by excluding the results from each column's extractor before it.

For example, at the end of it the extractor pulling the currency values in the "Amount Paid" column will need to ignore all the currency values from columns before it. From the "Provider Charge", "Our Allowance", "Non Chargeable Amount", "Member Liability Amount", and "Other Amount" columns.

The extractor pulling the currency values in the "Provider Charge" column itself will need to ignore the currency values from columns before it. From the "Our Allowance", "Non Chargeable Amount", "Member Liability Amount", and "Other Amount" columns.

The extractor pulling values from the "Our Allowance" column will need to ignore the "Non Chargeable Amount", "Member Liability Amount", and "Other Amount" columns. And so on down the line.

So, a series of exclusion extractors will need to be built alongside building the Key-Value Pair extractors to exclude results in these preceding columns.

Row Match EOB 30.png

At the end of this, you end up with a series of Key-Value Pairs for a given column. Each one excludes the columns results before it using Exclusion Extractors and the columns after setting the Minimum Distance between a Key and Value.

Even the final "Amount Paid" column will correctly return a value if the cell is populated and will not if the cell is empty.

Row Match EOB 31.png

Click here to return to the top.

Create the Row Extractor

All this has been done to craft a Row Extractor that will properly match this table's structure. We will use these Key-Value Pairs as the building blocks for each column in our Row Extractor.

1. We will add a Data Type and call it "TBL-RM - EOB Table"

2. Add one child extractor for each Data Column object in the Data Table in the Data Model.

! Grooper will automatically map corresponding child extractor results to the Data Column in the Data Table, if they share the same name. These names need to match exactly. For example, the "Date of Svc" Data Type will populate the "Date of Svc" column, as long as their names are the same.
Row Match EOB 32.png

3. For the "Date of Svc" Data Type...

4. Set the Referenced Extractors property to the "Generic Date" extractor we created way back in Step 1.

Row Match EOB 33.png

5. For the remaining Data Types (each one corresponding to the remaining Data Columns), use the Referenced Extractors property to reference the corresponding Key-Value Pair you created for that particular column.

Row Match EOB 34.png

6. Finally, on the Data Table in the Data Model, we set the Row Extractor property to reference the extractor named "TBL-RM - EOB Table". This extractor now accurately matches the table's structure and extracts the right values from each cell. And, importantly, leaves cells blank when they are blank on the document.

Row Match EOB 35.png

Click here to return to the top.

Grooper Help Documentation

Row Match

Version Differences

The Row Match method is the oldest and original method of extracting table data in Grooper. As such, not much has changed from previous versions. However, there are two other table extract methods available in Grooper now, Header-Value and Infer Grid. These methods provide functionality either not available to Row Match, or make configuring table extraction much simpler for certain use cases.

See Also