2.80:Row Match (Table Extract Method): Difference between revisions

Revision as of 15:47, 20 January 2020

Row Match is one of three methods available to Data Table data elements to extract information from tables on a document set. It uses regular expression pattern matching to determine a tables structure based on the pattern of each row and extract cell data from each column.

About

The configuration of Row Match is relatively simple. As you can see in the image below, there is only one property to configure, the "Row Extractor". Just like any other extractor, it can be an Internal pattern or a Reference to an extractor in the Node Tree.

This extractor is made to find the general pattern of each row in a table. There are many ways to accomplish this. We will look at a simple example with a fairly simple solution. Examine the table below.

What we are trying to do is model the table's structure using regular expression. Each row has a pattern to it from left to right. First a date, followed by a five digit number, followed by a name of variable length, followed by a one to two digit number followed by a variable dollar amount. This is the table's structure, a series of rows containing this information, in this order. We can use regular expression to match that pattern of data. First, we might break it down further and look at patterns to target each column, one after the other. That could look like something below.

Regular expression	Match found as seen on image	Notes
`\d{1,2}/\d{1,2}/\d{4}`		This pattern is just one of may ways to find this specific date format. This pattern would need to be adjusted to find other formats such as "month-day-year".
`\d{5}`		A simple regex to only find a five digit number.
`[A-Z]+`		This pattern also matches words outside of the row, but that's ok. We will put this all together and only match the row shortly.
`\d{1,2}`		Same with this pattern.
`[$]\d{1,3}(,\d{3})?[.]\d{2}`		Again, one of many ways to match currency values.

Now that we can match every cell in the row, we just need to put all that information together and match the whole row. We can enable "tab marking" in a pattern to insert the "\t" control character between segments of text separated by a large amount of whitespace (larger than a normal space character). We can use this tab character to make a regular expression pattern seen below to match the entire row.

Regular expression for the entire row	`\d{1,2}/\d{1,2}/\d{4}\t\d{5}\t\w+\t\d{1,2}\t[$]\d{1,3}(,\d{3})?[.]\d{2}`
Each row matches the single expression

Once you are able to match each row, the next part is to extract the data in that row corresponding to each row. This can be done in a number of ways, but an important thing to point out is what you're trying to match. Once Grooper finds a row using the Row Extractor, it creates an "instance" of that extracted row. You can think of this as a sub-element within the whole document. You still use regular expression to match the information within the row, but instead of matching against the entire document, you are only matching against this sub-element (or "instance"). This can trip users up who are trying to match data against the entire document. Only the portion of the text data in the extracted row is contained in the instance.

A single row matched on the document.

The match seen in the whole document's text data.

The text data in the instance of that matched row.
This is the text data you will use to pattern match each column's value. Notice you no longer have the control characters `\r` and `\n` at the end of the line. Those characters are part of the whole document's text data, but not the instance's

Just like there are many different extraction techniques to match each row, there are multiple ways table cell data is extracted from these row instances. Refer to the "How To" sections of this article for more information on how to set up a Data Table using Row Match and configure Data Columns to extract table data from a document.

Version Differences

The Row Match method is the oldest and original method of extracting table data in Grooper. As such, not much has changed from previous versions. However, there are two other table extract methods available in Grooper now, Header-Value and Infer Grid. These methods provide functionality either not available to Row Match, or make configuring table extraction much simpler for certain use cases.

Use Cases

The Row Match method is perfectly suited for fairly simple table structures. Spreadsheets with or without table lines are great candidates for the Row Match method. It is often the first "go-to" method of many Grooper designers due to its simplicity to set up and configure.

If you are processing a large number of the same report (or even very similar reports in some cases) over and over where the table structure remains consistent, Row Match is usually the best way to target and extract tabular data. It's easy to set up (often just using a single regular expression pattern). There are most often fewer objects to create and configure than other methods. This gives it the benefit of being computationally efficient as well.

Row Match has the added benefit of not relying on labels for each column header to model the table's structure. While most tables do have column headers, some don't. Row Match can produce results in those cases where other methods such as Header-Value rely on that information.

However, once the table's structure starts to change from document to document, a different approach may be needed. Different parties are going to structure tables however they want, which is well outside your control. Think of all the different ways an invoice can be structured. While the information you want is present in all the different tables, how that data is presented may not be consistent. Even just the column location changing can present problems for this method. A row extractor using a single pattern may not be able to do the job (or a complicated row extractor accounting for multiple row formats may need to be used). For these situations, the Header-Value method may be easier to configure and produce better results.

Optional data columns, where values may or may not be present in a cell can complicate things as well. Again, a simple row extractor using a single regex pattern may not do the trick. While a more complicated extractor may successfully extract the table's information, the Header-Value or Infer Grid methods may be simpler to set up and produce the same or even better results.

These are different Oil and Gas Production Reports from various sources. Each one organizes information differently into tables in different ways. Row Match would work just fine for each individual document. However, while the same information exists on each document, there's enough variability in the table structures that Row Match may not be suited for processing the whole document set (Header-Value usually produces a better result).

How To

There are multiple ways to use Row Match to extract tablular data. The following tutorials will give you insight on how to set up and use Row Match in a few different ways. However, before configuring the Row Match method, we must create a Grooper Data Table with Data Columns

Creating a Data Table in Grooper

PrereqsStep 1Step 2Step 3Step 4

Before you begin

A Data Table is a Data Element used to model and extract a table's information on a document. Just like other Data Elements, such as Data Fields and Data Sections, Data Tables are created as children of a Data Model. This guide assumes you have created a Content Model with a Data Model.

We will use the table below as our example for creating a Data Table.

Navigate to a Data Model

Using the Node Tree on the left side of Grooper Design Studio, navigate to the Data Model you wish to add the Data Table to. Data Tables can be created as children of any Data Model at any hierarchy in a Content Model.

Add a Data Table

Right click the Data Model object, mouse over "Add" and select "Data Table"

The following window will appear. Name the table whatever you would like and press "OK" when finished.

This creates a new Data Table object in the Node Tree underneath the Data Model.

Add Data Columns

Right click the Data Table object, mouse over "Add" and select "Data Column"

This brings up the following window to name the Data Column. When finished, press "OK" to create the object.

This creates a new Data Column object in the Node Tree underneath the Data Model.

Repeat Until Finished

Add as many columns as necessary to complete the table. For our example, we have a single Data Table with five Data Columns, each one named for the corresponding column on the document.

Using Row Match with Column Extractors

The first part of extracting information using the Row Match method is to create a Row Extractor to determine the table's structure. Once that is done, we can target and extract the information from each column. But how do we actually get the information out of the columns? There are a few ways. One is to use the "Column Extractor" property on each Data Column. This guide will demonstrate how to create a Row Extractor and use Column Extractors to extract information from a table.

PrereqsStep 1Step 2Step 3Step 4

A Data Table is a Data Element used to model and extract a table's information on a document. Just like other Data Elements, such as Data Fields and Data Sections, Data Tables are created as children of a Data Model. This guide assumes you have created a Content Model with a Data Model.

We will use the table below as our example. This is a production report filed with the Oklahoma Corporation Commission from an oil and gas company. The raw text data has already been extracted using OCR via the Recognize activity.

Add a Data Table

Create a Data Table with five Data Columns. The five columns for our example are "Operator Name", "Well Name", "Lease Number", "PC", and "Runs". Refer to the Creating a Data Table section above for more information on adding a Data Table to a Data Model.

Set the Extract Method

First, set the "Extract Method" property to "Row Match". (1) Select the Data Table object in the Node Tree, and (2) select the "Extract Method" property.

Using the dropdown list, select "Row Match".

Create the Row Extractor

The first part of configuring the Row Match method is creating a Row Extractor. The Row Extractor uses regular expression to determine the pattern of each row in the table. The extractor's "Type" can be "Internal" or "Reference". Choosing "Internal" will allow you to write a regular expression pattern straight from the property panel. "Reference" will allow you to point to an extractor built elsewhere in the Node Tree. Internal extractors are typically very simple patterns that don't need the extra properties available to Data Type extractors, such as collation, filtering or post-processing. Our case here is very simple. We will create the Row Extractor using an Internal extractor.

Expand the Row Match method's properties by double clicking "Extract Method". Press the ellipsis button next to "(empty pattern)". This will bring up Grooper's Pattern Editor to create a simple Internal extractor for the Row Match method.

Using the Pattern Editor, we will write a regular expression pattern to match each row of the table.

More specifically, we need to find something in one of these five columns to anchor off of. We need to target at least one piece of information we can use to pattern the whole row. Let's try and find the cell that has a fairly specific pattern to match. The "Operator Name" and "Well Name" seems extremely variable. The name of the well operator could be anything. It could be made of letters, numbers, special characters. Who knows? And the well name has a similar problem. "PC" looks like it's reliably two digits, but the "Lease Number" and "Runs" columns would also match a regex that is just two digits. The "Runs" column presents its own problem because not every row has a value in that column.

Information in the "Lease Number" column, however, has a very unique pattern to it. These leases always appear to be numbered as three digits followed by a dash, followed by six numbers, another dash, then one number, one more dash, and ends with four numbers. This pattern can be captured easily by the following regex pattern: \d{3}-\d{6}-\d-\d{4} Furthermore, this pattern will not match any of the other columns (While it's possible an operator's name could be "123-123456-1-1234 Resources Company", it's extraordinarily unlikely.)

Now that we have a portion of each row in the table matching, we just need to capture the rest of the row. To do this, let's switch over to the "Text" tab. This is the raw text data our regex pattern is using to find a match. From here you can see the control characters \r and \n at the end of every line of text. Looking at our rows here, we can see each row is on a new line of text. So, each row is between these two characters, \r\n. We can use \n and \r as anchors on either end of our pattern to expand the pattern and capture entire row.

We can use a negated character set to capture everything to the left of the lease number and another one to capture everything on the right. Since each row starts with a new line character (\n), everything to the left of the lease number is not a new line character. Since every row ends with a carriage return character (\r), everything to the right of the lease number is not a carriage return character. So, we can match the entire row with the following regular expression: [^\n]+\d{3}-\d{6}-\d-\d{4}[^\r]+

Press the "OK" button to finish editing the Row Extractor.

Now, we have our Row Extractor. With this single line of regular expression, we match the pattern of each of the 133 rows in this table. This is the part of Row Match extraction that gives us the table structure. Now we have the general idea of what the table looks like, row by row. But, we don't have any information populating our table in Grooper. Next, we need to extract that information and fill the Data Columns in our Data Table.

Extracting Table Data with Column Extractors

Each Data Column has a "Value Extractor" property that can be used to extract data from each row. This too can be either an "Internal" or "Reference" extractor.

FYI

Version 2.9 added several new extraction methods available to Data Fields and Data Columns besides "Internal" and "Reference". "Internal" has been renamed "Text Pattern", but it functions exactly like an Internal pattern.

We will start with the "Operator Name" column. We will use an extractor to find the portion of the row containing the operator's name ("Samson Resources Company" for example). (1) Select the "Operator Name" Data Column in the Node Tree. (2) Select the "Value Extractor" property and choose (3) "Text Pattern" from the dropdown list.

Instances vs Full Text

Before we start editing a regex pattern to match the operator name on each row, we need to look at what text we are actually matching against. Once the Row Extractor runs, an instance for each row is created. So, for our table, we have 133 rows. 133 instances are created, one for each row. If we go back to the pattern we created for the Row Extractor, we can see what is contained in an instance for a single row. Select a result from the Results list and choose "Inspect Instance..."

This brings up the "Instance Viewer" window, seen below.

The important thing to keep in mind is the Value Extractor runs against each row's instance, not the entire document. Switch to the "Text View" tab. Notice the text here is only limited to the information extracted for that row. The control characters \r and \n do not appear, as they do with the full document's text data. This is important to keep in mind. Since those control characters do not exist in the instance, you will not be able to match them using regular expression. Those control characters are outside of the instance. Only the text data inside the instance will be matched by the Data Column's Value Extractor.

Using Tabs to Indicate Column Breaks

Let's go back to the "Operator Name" Data Column. Expand the "Value Extractor" sub-properties by double clicking it. Select the "Pattern" property and press the ellipsis button at the end to bring up the Pattern Editor.

Now, we just need to write a regex pattern to match the operator name portion of the row instance. First things first, it will be helpful if we can use something in the text data to note breaks between columns. Generally, columns are separated either by table lines or large amount of whitespace (at least larger than a normal space between words). A large amount of whitespace can be represented in text by a tab character (\t). By default, Grooper leaves tab characters out of the text data. A space between words is just a space regardless of the size. However, we can insert these characters into the text data by enabling tab marking in the pattern's preprocessing options.

Switch to the "Properties" tab of the Pattern Editor.

Double click the "Preprocessing Options" property to expand its subproperties. Select "Tab Marking" and change it to "Enabled". This will insert tab characters (\t) where there is a larger than normal amount of whitespace between letters.

Switch back to the "Pattern Editor" tab. Now that we have tab characters available in the text, we can use them to help target text in individual columns. The tab character will be the indicator of where one column stops and another begins.

Pattern Matching the Operator Name

Keep in mind we're trying to write a pattern to find the operator names of each row. Let's think about what we know about the operator names. First, they're going to be highly variable. Company names do not have to follow strict rules. They can contain any number of words, numbers, or other characters in very unpredictable ways. Second, they're in the first column of the row. Third, regardless of what text comprises an operator's name, it's not going to be a control character, like a \t tab character.

Knowing this, we can use a negated character set of variable length matching everything not a tab character or in regex: [^\t]+

That takes care of matching the operator names themselves, but it also matches everything else between a tab. We need to narrow this down to just the operator name.

We could take a similar approach we did with our Row Extractor. Knowing the Operator Name is the first column, we know it is always at the start of a new line, which means there will always be a \n new line control character before the operator name. We could use a look ahead pattern to limit our results only to values that have that \n character directly before them.

However, we are not matching the full text like we see here. We are matching each extracted row's individual instance. What we are actually matching is what's seen below for every row.

Since the \n character is not a part of the row's instance, using it in the look ahead will not produce a result. However, there are special characters in regular expression to mark the start or end of a string of text. The carrot ^ can be used to anchor of the beginning of a string. The dollar sign $ can be used to anchor of the end of a string.

It's easy to forget column extractors are matching against row instances and not the entire text. There are similar situations in Grooper that can trip users up where they think they are matching against full text but are actually matching against an instance. Because of this, many designers consider it best practice to use "\n|^", meaning a match for a new line command or the beginning of a string, in every case you would use just "\n". Using "\n|^" as a look ahead pattern will functionally match something that begins at a new line, whether or not the text is matched as full text or within an instance (Technically, in an instance, it matches if it's at the beginning of the string or the beginning of the instance).

This pattern will now successful extract the operators' names from the Operator Name column. Also, note there are a lot of matches outside of the table. That's the beauty about instancing. Since the column extractor is only looking at the instances produced by the Row Extractor, everything else on the page is ignored. This allows use much simpler regular expression. The effort of narrowing down what we want to match has already been partly done by creating an instance of just the row of text.

Press the "OK" button to finish editing the pattern.

Press the "Save" button to finish configuring the Data Column.

Using Row Match with Named Groups

Prereqs

Using Row Match with Ordered Arrays

Prereqs

@@ Line 315: / Line 315: @@
 ==== Pattern Matching the Operator Name ====
-Keep in mind we're trying to write a pattern to find the operator names of each row.  Let's think about what we know about the operator names.  First, they're going to be highly variable.  Company names do not have to follow strict rules.  They can contain any number of words, numbers, or other characters in very unpredictable ways.  Second, they're in the first column of the row.  Third, regardless of what text comprises an operator's name, it's not going to be a control character, like a <code>\t</tab> tab character.
+Keep in mind we're trying to write a pattern to find the operator names of each row.  Let's think about what we know about the operator names.  First, they're going to be highly variable.  Company names do not have to follow strict rules.  They can contain any number of words, numbers, or other characters in very unpredictable ways.  Second, they're in the first column of the row.  Third, regardless of what text comprises an operator's name, it's not going to be a control character, like a <code>\t</code> tab character.
-Knowing this, we can use a negated character set of variable length matching everything not a tab character or in regex: <code>[^\t]+<\code>
+Knowing this, we can use a negated character set of variable length matching everything not a tab character or in regex: <code>[^\t]+</code>
 That takes care of matching the operator names themselves, but it also matches everything else between a tab.  We need to narrow this down to just the operator name.
@@ Line 337: / Line 337: @@
-Since the <code>\n</code> character is not a part of the row's instance, using it in the look ahead will not produce a result.  However, there are special characters in regular expression to mark the start or end of a string of text.  The carrot <code>^</code> can be used to anchor of the beginning of a string.  The dollar sign <code>$</cdoe> can be used to anchor of the end of a string.
+Since the <code>\n</code> character is not a part of the row's instance, using it in the look ahead will not produce a result.  However, there are special characters in regular expression to mark the start or end of a string of text.  The carrot <code>^</code> can be used to anchor of the beginning of a string.  The dollar sign <code>$</code> can be used to anchor of the end of a string.
 It's easy to forget column extractors are matching against row instances and not the entire text.  There are similar situations in Grooper that can trip users up where they think they are matching against full text but are actually matching against an instance.  Because of this, many designers consider it best practice to use "<code>\n|^</code>", meaning a match for a new line command or the beginning of a string, in every case you would use just "<code>\n</code>".  Using "<code>\n|^</code>" as a look ahead pattern will functionally match something that begins at a new line, whether or not the text is matched as full text or within an instance (Technically, in an instance, it matches if it's at the beginning of the string or the beginning of the instance).