Row Match (Table Extract Method)

From Grooper Wiki
(Redirected from Row Match)

This article was migrated from an older version and has not been updated for the current version of Grooper.

This tag will be removed upon article review and update.

This article is about the current version of Grooper.

Note that some content may still need to be updated.

2025 20232.80
The Row Match method's property panel

The Row Match Table Extract Method uses regular expression pattern matching to determine a tables structure based on the pattern of each row and extract cell data from each column.


You may download and import the file(s) below into your own Grooper environment (version 2023). There is a Batch with the example document(s) discussed in this tutorial, as well as a Project configured according to its instructions.
Please upload the Project to your Grooper environment before uploading the Batch. This will allow the documents within the Batch to maintain their classification status.

About

The method you use to extract information from tables is set on Data Table objects added to a Data Model. This is done using the "Extract Method" property.



Once you select an Extract Method (in this case "Row Match"), you can see all the configurable properties available to that particular method by expanding the "Extract Method" property.



The configuration of Row Match is relatively simple. As you can see in the image below, there are several properties but we will concern ourselves with the main one, the "Row Extractor". Just like any other extractor, it can be an Internal pattern or a Reference to an extractor in the Node Tree.

This extractor is made to find the general pattern of each row in a table. There are many ways to accomplish this. We will look at a simple example with a fairly simple solution. Examine the table below.

What we are trying to do is model the table's structure using regular expression. Each row has a pattern to it from left to right. First a date, followed by a five digit number, followed by a name of variable length, followed by a one to two digit number followed by a variable dollar amount. This is the table's structure, a series of rows containing this information, in this order. We can use regular expression to match that pattern of data. First, we might break it down further and look at patterns to target each column, one after the other. That could look like something below.

Regular expression Match found as seen on image Notes
\d{1,2}/\d{1,2}/\d{4} This pattern is just one of may ways to find this specific date format. This pattern would need to be adjusted to find other formats such as "month-day-year".
\d{5} A simple regex to only find a five digit number.
[A-Z]+ This pattern also matches words outside of the row, but that's ok. We will put this all together and only match the row shortly.
\d{1,2} Same with this pattern.
[$]\d{1,3}(,\d{3})?[.]\d{2} Again, one of many ways to match currency values.

Now that we can match every cell in the row, we just need to put all that information together and match the whole row. We can enable "tab marking" in a pattern to insert the "\t" control character between segments of text separated by a large amount of whitespace (larger than a normal space character). We can use this tab character to make a regular expression pattern seen below to match the entire row.

Regular expression for the entire row \d{1,2}/\d{1,2}/\d{4}\t\d{5}\t\w+\t\d{1,2}\t[$]\d{1,3}(,\d{3})?[.]\d{2}
Each row matches the single expression

Once you are able to match each row, the next part is to extract the data in that row corresponding to each column. This can be done in a number of ways, but an important thing to point out is what you're trying to match. Once Grooper finds a row using the Row Extractor, it creates an "instance" of that extracted row. You can think of this as a sub-element within the whole document. You still use regular expression to match the information within the row, but instead of matching against the entire document, you are only matching against this sub-element (or "instance"). This can trip users up who are trying to match data against the entire document. Only the portion of the text data in the extracted row is contained in the instance.

A single row matched on the document.
The match seen in the whole document's text data.
The text data in the instance of that matched row.
This is the text data you will use to pattern match each column's value. Notice you no longer have the control characters \r and \n at the end of the line. Those characters are part of the whole document's text data, but not the instance's

Just like there are many different extraction techniques to match each row, there are multiple ways table cell data is extracted from these row instances. Refer to the "How To" sections of this article for more information on how to set up a Data Table using Row Match and configure Data Columns to extract table data from a document.

Use Cases: General

A spreadsheet well suited for Row Match. It is very structured. There's a detectable pattern for each row. Even if the column labels were not present, Row Match could extract this table's information, likely with a single regular expression pattern.

The Row Match method is perfectly suited for fairly simple table structures. Spreadsheets with or without table lines are great candidates for the Row Match method. It is often the first "go-to" method of many Grooper designers due to its simplicity to set up and configure.

If you are processing a large number of the same report (or even very similar reports in some cases) over and over where the table structure remains consistent, Row Match is usually the best way to target and extract tabular data. It's easy to set up (often just using a single regular expression pattern). There are most often fewer objects to create and configure than other methods. This gives it the benefit of being computationally efficient as well.

Row Match has the added benefit of not relying on labels for each column header to model the table's structure. While most tables do have column headers, some don't. Row Match can produce results in those cases where other methods such as Tabular Layout rely on that information.

However, once the table's structure starts to change from document to document, a different approach may be needed. Different parties are going to structure tables however they want, which is well outside your control. Think of all the different ways an invoice can be structured. While the information you want is present in all the different tables, how that data is presented may not be consistent. Even just the column location changing can present problems for this method. A row extractor using a single pattern may not be able to do the job (or a complicated row extractor accounting for multiple row formats may need to be used). For these situations, the Tabular Layout method may be easier to configure and produce better results.

Optional data columns, where values may or may not be present in a cell can complicate things as well. Again, a simple row extractor using a single regex pattern may not do the trick. While a more complicated extractor may successfully extract the table's information, the Tabular Layout or Infer Grid methods may be simpler to set up and produce the same or even better results.


These are different Oil and Gas Production Reports from various sources. Each one organizes information differently into tables in different ways. Row Match would work just fine for each individual document. However, while the same information exists on each document, there's enough variability in the table structures that Row Match may not be suited for processing the whole document set (Tabular Layout usually produces a better result).


The Grid Layout method also has some advantages over Row Match. There are some specialized use cases, such as reading OMR checkboxes in tables and reprocessing table cells using a secondary OCR profile, where Infer Grid does things the other two methods simply can’t. Infer Grid also performs well when table line information can be saved to a page’s layout data.

How To

There are multiple ways to use Row Match to extract tablular data. The following tutorials will give you insight on how to set up and use Row Match in a few different ways. However, before configuring the Row Match method, we must create a Grooper Data Table with Data Columns

Creating a Data Table in Grooper

Before you begin

A Data Table is a Data Element used to model and extract a table's information on a document. Just like other Data Elements, such as Data Fields and Data Sections, Data Tables are created as children of a Data Model. This guide assumes you have created a Content Model with a Data Model.

We will use the table below as our example for creating a Data Table.

Navigate to a Data Model

Using the Node Tree on the left side of Grooper Design Studio, navigate to the Data Model you wish to add the Data Table to. Data Tables can be created as children of any Data Model at any hierarchy in a Content Model.

Add a Data Table

Right click the Data Model object, mouse over "Add" and select "Data Table"


The following window will appear. Name the table whatever you would like and press "Execute" when finished.


This creates a new Data Table object in the Node Tree underneath the Data Model.

Add Data Columns

Right click the Data Table object, mouse over "Add" and select "Data Column"


This brings up the following window to name the Data Column. When finished, press "Execute" to create the object.


This creates a new Data Column object in the Node Tree underneath the Data Model.

Repeat Until Finished

Add as many columns as necessary to complete the table. For our example, we have a single Data Table with five Data Columns, each one named for the corresponding column on the document.


Using Row Match with Column Extractors

The first part of extracting information using the Row Match method is to create a Row Extractor to determine the table's structure. Once that is done, we can target and extract the information from each column. But how do we actually get the information out of the columns? There are a few ways. One is to use the "Value Extractor" property on each Data Column (These are often referred to as a "Column Extractors"). This guide will demonstrate how to create a Row Extractor and use Column Extractors to extract information from a table.

Before you begin

A Data Table is a Data Element used to model and extract a table's information on a document. Just like other Data Elements, such as Data Fields and Data Sections, Data Tables are created as children of a Data Model. This guide assumes you have created a Content Model with a Data Model.

We will use the table below as our example. This is a production report filed with the Oklahoma Corporation Commission from an oil and gas company. The raw text data has already been extracted using OCR via the Recognize activity.

Add a Data Table

Create a Data Table with five Data Columns. The five columns for our example are "Operator Name", "Well Name", "Lease Number", "PC", and "Runs". Refer to the Creating a Data Table section above for more information on adding a Data Table to a Data Model.

Set the Extract Method

First, set the "Extract Method" property to "Row Match". (1) Select the Data Table object in the Node Tree, and (2) select the "Extract Method" property.


Using the dropdown list, select "Row Match".


Expand sub-properties of "Extract Method", click the drop-down for "Row Extractor", and choose "Pattern Match" as your Extractor Type.

Create the Row Extractor

The first part of configuring the Row Match method is creating a Row Extractor. The Row Extractor uses regular expression to determine the pattern of each row in the table. The extractor's "Type" can be "Internal" or "Reference". Choosing "Internal" will allow you to write a regular expression pattern straight from the property panel. "Reference" will allow you to point to an extractor built elsewhere in the Node Tree. Internal extractors are typically very simple patterns that don't need the extra properties available to Data Type extractors, such as collation, filtering or post-processing. Our case here is very simple. We will create the Row Extractor using an Internal extractor.

Expand the Row Match method's properties by double clicking "Extract Method". Press the ellipsis button next to "(empty pattern)". This will bring up Grooper's Pattern Editor to create a simple Internal extractor for the Row Match method.


Using the Pattern Editor, we will write a regular expression pattern to match each row of the table.


More specifically, we need to find something in one of these five columns to anchor off of. We need to target at least one piece of information we can use to pattern the whole row. Let's try and find the cell that has a fairly specific pattern to match. The "Operator Name" and "Well Name" seems extremely variable. The name of the well operator could be anything. It could be made of letters, numbers, special characters. Who knows? And the well name has a similar problem. "PC" looks like it's reliably two digits, but the "Lease Number" and "Runs" columns would also match a regex that is just two digits. The "Runs" column presents its own problem because not every row has a value in that column.

Information in the "Lease Number" column, however, has a very unique pattern to it. These leases always appear to be numbered as three digits followed by a dash, followed by six numbers, another dash, then one number, one more dash, and ends with four numbers. This pattern can be captured easily by the following regex pattern: \d{3}-\d{6}-\d-\d{4} Furthermore, this pattern will not match any of the other columns (While it's possible an operator's name could be "123-123456-1-1234 Resources Company", it's extraordinarily unlikely.)


Now that we have a portion of each row in the table matching, we just need to capture the rest of the row. To do this, let's switch over to the "Text" tab. This is the raw text data our regex pattern is using to find a match. From here you can see the control characters \r and \n at the end of every line of text. Looking at our rows here, we can see each row is on a new line of text. So, each row is between these two characters, \r\n. We can use \n and \r as anchors on either end of our pattern to expand the pattern and capture entire row.


We can use a negated character set to capture everything to the left of the lease number and another one to capture everything on the right. Since each row starts with a new line character (\n), everything to the left of the lease number is not a new line character. Since every row ends with a carriage return character (\r), everything to the right of the lease number is not a carriage return character. So, we can match the entire row with the following regular expression: [^\n]+\d{3}-\d{6}-\d-\d{4}[^\r]+


Press the "OK" button to finish editing the Row Extractor.


Now, we have our Row Extractor. With this single line of regular expression, we match the pattern of each of the 133 rows in this table. This is the part of Row Match extraction that gives us the table structure. Now we have the general idea of what the table looks like, row by row. But, we don't have any information populating our table in Grooper. Next, we need to extract that information and fill the Data Columns in our Data Table.

Extracting Table Data with Column Extractors

Each Data Column has a "Value Extractor" property that can be used to extract data from each row. This too can be either an "Internal" or "Reference" extractor.

We will start with the "Operator Name" column. We will use an extractor to find the portion of the row containing the operator's name ("Samson Resources Company" for example). (1) Select the "Operator Name" Data Column in the Node Tree. (2) Select the "Value Extractor" property and choose (3) "Text Pattern" from the dropdown list.

Instances vs Full Text

Before we start editing a regex pattern to match the operator name on each row, we need to look at what text we are actually matching against. Once the Row Extractor runs, an instance for each row is created. So, for our table, we have 133 rows. 133 instances are created, one for each row. If we go back to the pattern we created for the Row Extractor, we can see what is contained in an instance for a single row. Select a result from the Results list and choose "Inspect Instance..."


This brings up the "Instance Viewer" window, seen below.


The important thing to keep in mind is the Value Extractor runs against each row's instance, not the entire document. Switch to the "Text View" tab. Notice the text here is only limited to the information extracted for that row. The control characters \r and \n do not appear, as they do with the full document's text data. This is important to keep in mind. Since those control characters do not exist in the instance, you will not be able to match them using regular expression. Those control characters are outside of the instance. Only the text data inside the instance will be matched by the Data Column's Value Extractor.

The \n and \r control characters are not contained in this instance. So, they can not be used as anchors to find text at the beginning or end of the text. However, the regex characters ^ and $ can be used instead.

^ is a special character that will match the beginning of a string. Instead of using \n to match the beginning of a line in the full document's text, we can use ^ to match the beginning of the instance.

$ is a special character that will match the beginning of a string. Instead of using \r to match the end of a line in the full document's text, we can use $ to match the beginning of the instance.

Using Tabs to Indicate Column Breaks

Let's go back to the "Operator Name" Data Column. Expand the "Value Extractor" sub-properties by double clicking it. Select the "Pattern" property and press the ellipsis button at the end to bring up the Pattern Editor.



Now, we just need to write a regex pattern to match the operator name portion of the row instance. First things first, it will be helpful if we can use something in the text data to note breaks between columns. Generally, columns are separated either by table lines or large amount of whitespace (at least larger than a normal space between words). A large amount of whitespace can be represented in text by a tab character (\t). By default, Grooper leaves tab characters out of the text data. A space between words is just a space regardless of the size. However, we can insert these characters into the text data by enabling tab marking in the pattern's preprocessing options.

Tab characters can be used to represent column breaks in a table. This gives us something we can pattern match as an indicator where one column stops and another begins. With Tab Marking enabled...
...column breaks in the table on the document turn into... ...tab characters in the text flow.

<

Switch to the "Properties" tab of the Pattern Editor.


Double click the "Preprocessing Options" property to expand its subproperties. Select "Tab Marking" and change it to "Enabled". This will insert tab characters (\t) where there is a larger than normal amount of whitespace between letters.


Switch back to the "Pattern Editor" tab. Now that we have tab characters available in the text, we can use them to help target text in individual columns. The tab character will be the indicator of where one column stops and another begins.

Pattern Matching the Operator Name

Keep in mind we're trying to write a pattern to find the operator names of each row. Let's think about what we know about the operator names. First, they're going to be highly variable. Company names do not have to follow strict rules. They can contain any number of words, numbers, or other characters in very unpredictable ways. Second, they're in the first column of the row. Third, regardless of what text comprises an operator's name, it's not going to be a control character, like a \t tab character.

Knowing this, we can use a negated character set of variable length matching everything not a tab character or in regex: [^\t]+

That takes care of matching the operator names themselves, but it also matches everything else between a tab. We need to narrow this down to just the operator name.


We could take a similar approach we did with our Row Extractor. Knowing the Operator Name is the first column, we know it is always at the start of a new line, which means there will always be a \n new line control character before the operator name. We could use a look ahead pattern to limit our results only to values that have that \n character directly before them.


However, we are not matching the full text like we see in the Pattern Editor. We are matching each extracted row's individual instance. What we are actually matching is what's seen below for every row.


The \n and \r control characters are not contained in this instance. So, they can not be used as anchors to find text at the beginning or end of the text. However, the regex characters ^ and $ can be used instead.

^ is a special character that will match the beginning of a string. Instead of using \n to match the beginning of a line in the full document's text, we can use ^ to match the beginning of the instance.

$ is a special character that will match the beginning of a string. Instead of using \r to match the end of a line in the full document's text, we can use $ to match the beginning of the instance.

Since the \n character is not a part of the row's instance, using it in the look ahead will not produce a result. However, there are special characters in regular expression to mark the start or end of a string of text. The carrot ^ can be used to anchor of the beginning of a string in an instance. The dollar sign $ can be used to anchor of the end of a string in an instance.

It's easy to forget Column Extractors are matching against row instances and not the entire text. There are similar situations in Grooper that can trip users up where they think they are matching against full text but are actually matching against an instance. Because of this, many designers consider it best practice to use "\n|^" (meaning a match for a both a new line in the full text or the beginning of a string in an instance) in every case you would use just "\n".

Using "\n|^" as a look ahead pattern here will functionally match something at the beginning of a row's line, whether you're looking at the full text or a single row's instance.

FYI

The regex pattern \n|^ will match the beginning of a new line or the beginning of an instance. When dealing with tables, get in the habit of using this pattern to match both. This way, the pattern will match when running whether you are attempting to match the full text or text inside an instance.


This pattern will now successful extract the operators' names from the Operator Name column. Also, note there are a lot of matches outside of the table. That's the beauty about instancing. Since the column extractor is only looking at the instances produced by the Row Extractor, everything else on the page is ignored. This allows use much simpler regular expression. The effort of narrowing down what we want to match has already been partly done by creating an instance of just the row of text.


Press the "OK" button to finish editing the pattern.


Press the "Save" button to finish configuring the Data Column.

Configure the Remaining Column Extractors

The remaining table data can be extracted using Value Extractors on the remaining four Data Columns. These will also use "Text Pattern" as their extracting method.

Well Name

The well name can be found taking a similar approach as the Operator Name. We will use the same [^\t]+ pattern with an expanded look ahead pattern to skip the operator name from the first column. To the \n|^ pattern will will add [^\t]+\t to expand it to include the beginning of the row, the first amount of text and the first tab character in the look ahead. Note, this pattern also uses tab characters. So, Tab Marking needs to be enabled in the Preprocessing Options.

Lease Number

The lease number was easily matched using a unique pattern in our Row Extractor. We can use the same pattern to match the individual lease numbers in the row instances, without the use of a look ahead pattern.

PC

There's a variety of ways to target information from these rows. The PC Data Column is for the product code, which can be either "01" or "02". Since we know the product code comes after the lease number in the table, we could use the lease number's pattern followed by a tab character as the look ahead pattern. Or, we could be even lazier. We can produce a match without even bothering to turn on Tab Marking. Without Tab Marking, the whitespace on either side of the product code is just a simple space. Nowhere else in the text of a row do we see the pattern of a space followed by "01" or "02" followed by a space. Knowing this, we can make a super simple pattern to find the product code 0[12] with the space control character \s as the look ahead and look behind pattern.


Side Note: Using Local Lexicons for Result Translation
This pattern works perfectly. It returns either "01" or "02" but do those numbers mean? Let's say "01" means the product is crude oil and "02" means the product is condensate oil. What if the information we want to extract (and specifically what we want to export) are the words "crude oil" and "condensate oil" and not the numbers? This gives us a great opportunity to do some quick translation of these terms in Grooper. Switch to the "Properties" tab, select the "Lookup Options" property and press the ellipsis button at the end of the property.


This brings up the "Lookup Options" property window. Expand the "Vocabulary" property and select "Local Entries" Press the ellipsis button at the end of the property.


This brings up a List Editor to write a custom Lexicon used only by the pattern. We can write out some quick translation here by writing out "01=Crude Oil" and "02=Condensate Oil". Press the "OK" button when finished.


Last, change the "Enable Translation" property from "False" to "True". This will actually perform the translation,changing the values from "01" and "02" to "Crude Oil" and "Condensate Oil". Press the "OK" button when finished.


Now, the original extracted value is translated according to what was set in the Local Entries. So, if the extractor returns "01", it gets changed to "Crude Oil" and "02" is changed to "Condensate Oil".


Runs

The Runs column is unique for two reasons. One, it's an optional column. Sometimes there is a value in these cells, sometimes not. Two, and more importantly, it's at the end of the row. Remember, we are matching the row instances and not the full text. Just like we couldn't use \n to anchor the first column's information to the beginning of a new line, we can't use \r to anchor the final column to the end of the line. However, since we are dealing with instances we can use the end of string character in regular expression, which is the dollar sign $

We can write a pattern to match a variety of currency values using the following regex: \d{1,3}(,\d{3}){0,2}[.]\d{2} with $ as the look behind pattern, therefore only matching a currency value at the end of the row instance. And, if it doesn't produce a match, that cell will simply be left blank.


FYI

Just like some Grooper designers will get in the habit of using \n|^ to signify the beginning of a line or start of string, \r|$ can be used for the end of a line or end of string. Getting into this habit will match whether in an instance or using the full document text data.

Test the Extraction

Now, Grooper knows how to find the table's structure using the Row Extractor and knows how to extract each cell's values according to the Data Column's Value Extractors (or Column Extractors for short). Let's see how we did. Navigate back to the Data Table object and press the "Test Extraction" button.


This shows us what data will be extracted during the Extract activity. As you can see below, all the information from the table is successfully extracted, for all 133 rows.

Using Row Match with Named Groups

While the example above works, there is a much more elegant way to do this in Grooper. "Named groups" allow users to target portions of a regular expression pattern and call it out in different ways. Using named groups, we can extract the exact same data from the exact same document without configuring Column Extractors. Only a single regular expression pattern will be written, that for the Row Extractor.

Before you begin

A Data Table is a Data Element used to model and extract a table's information on a document. Just like other Data Elements, such as Data Fields and Data Sections, Data Tables are created as children of a Data Model. This guide assumes you have created a Content Model with a Data Model.

Just like the previous guide, we will use the table below as our example. This is a production report filed with the Oklahoma Corporation Commission from an oil and gas company. The raw text data has already been extracted using OCR via the Recognize activity.

Add a Data Table

Create a Data Table with five Data Columns. These are the same Data Columns used in the previous example. The five columns for our example are "Operator Name", "Well Name", "Lease Number", "PC", and "Runs". Refer to the Creating a Data Table section above for more information on adding a Data Table to a Data Model.

Set the Extract Method

First, set the "Extract Method" property to "Row Match". (1) Select the Data Table object in the Node Tree, and (2) select the "Extract Method" property.


Using the dropdown list, select "Row Match".


Expand the sub-properties of "Extract Method", click the drop-down for "Row Extractor", and set to the "Pattern Match" Extractor Type.

Create the Row Extractor

Like the previous example, the first part of configuring the Row Match method is creating a Row Extractor. The Row Extractor uses regular expression to determine the pattern of each row in the table. The extractor's "Type" can be "Internal" or "Reference". Choosing "Internal" will allow you to write a regular expression pattern straight from the property panel. "Reference" will allow you to point to an extractor built elsewhere in the Node Tree. Internal extractors are typically very simple patterns that don't need the extra properties available to Data Type extractors, such as collation, filtering or post-processing. Our case here is very simple. We will create the Row Extractor using an Internal extractor.

Expand the Row Match method's properties by double clicking "Extract Method". Press the ellipsis button next to "(empty pattern)". This will bring up Grooper's Pattern Editor to create an Internal extractor for the Row Match method.


This brings up the Pattern Editor window to write a regex pattern.


The pattern we used in the previous example was as follows: [^\n]+\d{3}-\d{6}-\d-\d{4}[^\r]+. This did model the pattern of each row, and we were able to use the Data Columns' Value Extractors to return each column's information, but we had to configure each individual Data Column to do so. Named groups will allow us to use a single regular expression pattern, set on the Row Extractor, and use portions of the regex to fill in our columns' values automatically, without even touching the Data Columns' properties.

To do this, we first need to write a more specific regular expression pattern. The general idea is to write a pattern that matches the pattern of each column, one after the other, while also matching the entire row. The first thing we might want to do is enable Tab Marking. This will insert the \t tab character into the raw text data wherever there's a larger than normal amount of whitespace between characters (larger than a space character between words). This gives us something we can match in the text flow that functionally "means" a column break.


Tab characters can be used to represent column breaks in a table. This gives us something we can pattern match as an indicator where one column stops and another begins. With Tab Marking enabled...
...column breaks in the table on the document turn into... ...tab characters in the text flow.

Switch to the "Properties" tab of the Pattern Editor.


Double click the "Preprocessing Options" property to expand its subproperties. Select "Tab Marking" and change it to "Enabled". This will insert tab characters (\t) where there is a larger than normal amount of whitespace between letters.


Switch back to the "Pattern Editor" tab. Now that we have tab characters available in the text, we can use them to help target text in individual columns. Essentially, what we will do is combine all the individual patterns we wrote for the Column Extractors in the previous example in one large regex. Lets start with the Operator Name (captured by [^\t]+ followed by \t indicating the column break) with a look ahead pattern forcing our pattern to only match something on a new line or the start of a string (\n|^)


Next, we will match the Well Name column with [^\t]+\t.


Then, comes the Lease Number captured by \d{3}-\d{6}-\d-\d{4}\t.


Then, the PC (or Product Code) column. For document the following regex works \d{2}\t. However, keep in mind the Runs column may have a value in it and it may not depending on the row. Where the column is empty, there is a hyphen in the cell. However, it's more than possible that cell may just be left empty (or OCR may not have picked up that character). If that is the case, a tab character \t would not follow the two digits in the text flow. Instead, those two digits would be at the end of the line. So, the next character would be the carriage return character \r. Furthermore, a line of text at the end of the page does not terminate in a carriage return. The end of a page is indicated by a the form feed character \f. Last but not least, remember our Row Extractor is creating instances out of every row of text. The end of a line of text in an instance is indicated by the end of string character $

All of that to say, there are several different characters that could indicate we've come to the end of this column, including \t \r \f and $. We will account for all these possibilities by using the following regex ([\t\r\f]|$) instead of just \t. So it match any of the control characters in the character set ([\t\r\f]) or (|) the end of string character ($).

Putting it all together it comes to \d{2}([\t\r\f]|$)


Last comes the Runs column. The Runs column is unique in that it is an optional column. Data may or may not be present in it. We can account for this using the ? character in regular expression. We will create a group around the regex capturing the dollar amount by putting it in parenthesis, and making it optional by including the ? character after it.

The regex to match the currency values is \d{1,3}(,\d{3})?[.]\d{2} and we make it optional by using (\d{1,3}(,\d{3})?[.]\d{2})?


This single pattern is matching each row, accomplishing the primary goal of a Row Extractor. Now that we have a Row Extractor, how do we use named groups to fill in our column values?

Creating Named Groups for Fun and Profit

Next we are going to create five named groups in our regex pattern, one for each column. Each group will be named after the Data Column its corresponding regex pattern captures. The regex capturing the operator names will be named after the "Operator Name" Data Column. The regex capturing the well names and numbers will be named after the "Well Name" Data Column, and so on. The idea behind this method is to use the values returned by each group to populate values for each Data Column.

Our first column is "Operator Name". The first line of our Row Extractor captures the operator names with [\t]+ (Note this does not include the \t character we used to anchor values between column breaks. We just want the text that makes up an operator's name). This will be our first group. We've already created a group in the previous step when we put the regex to find the Runs values in parenthesis to make that part of our pattern optional with the ? character. The difference here is we're naming the group.

Named groups can be created and named using the following syntax: (?<Group_Name>regex pattern)

In between the ?< and > characters is where you will name the group. After the > is where the regex goes. And the whole thing is put between parenthesis.

We could type all this out, but Grooper has a shortcut. Highlight this portion of the regular expression with your mouse and right click it. At the bottom of the list of options is "Create Group". Click this option.


This places the selected regex in a named group and puts the cursor in between the ?< and > characters where we will name the group.


There are two key things to remember.

  1. We want the group to match the corresponding Data Column's name exactly.
  2. Spaces should be replaced with an underscore _


FYI

There is also a keyboard shortcut for creating a named group. It is Ctrl + G

Now we have a reference for part of the whole regular expression that we can use elsewhere.

Next, we need to create groups for the rest of our Data Columns. For the same thing we did for the Operator Name column, we will do for the remaining four Data Column. Keep in mind, we want just the values, not any control characters used to anchor those values in one column position or another.


You may be asking yourself, why do the names need to match the Data Columns? What are we really doing here? We're making instances! We've made individual sub-instances of the whole pattern, one for every named group. Right click one the results and select "Inspect Instance..."


For each of our five named groups, we now have five sub-instances (or four for rows that don't have a Runs value). Each instance contains information from its part of the regular expression pattern. Ultimately, the values contained in these instances will populate the individual cells of our Data Table.


And for the real magic, since these instance names match our Data Column names, we don't need to configure any extraction on the Data Columns. By default, Data Columns will look for instances in a Row Extractor to populate values. If the Data Column finds an instance with the same name, it will return the instance's value!

By using named groups, and naming them after our Data Table's Data Columns, the portion of the row the group captures will populate the cell in the corresponding Data Column. Let's see that in action next.

Test the Extraction

Select the Data Table in the Node Tree and press the "Test Extraction" button.


Success! Grooper successfully extracts all 133 table rows, using only the Row Extractor with named groups.

Version Differences

The Row Match method is the oldest and original method of extracting table data in Grooper. As such, not much has changed from previous versions. However, there are two other table extract methods available in Grooper now, Tabular Layout and Grid Layout. These methods provide functionality either not available to Row Match, or make configuring table extraction much simpler for certain use cases.

See Also