2.80:Row Match (Table Extract Method)

Row Match is one of three methods available to Data Table data elements to extract information from tables on a document set. It uses regular expression pattern matching to determine a tables structure based on the pattern of each row and extract cell data from each column.

About

The configuration of Row Match is relatively simple. As you can see in the image below, there is only one property to configure, the "Row Extractor". Just like any other extractor, it can be an Internal pattern or a Reference to an extractor in the Node Tree.

This extractor is made to find the general pattern of each row in a table. There are many ways to accomplish this. We will look at a simple example with a fairly simple solution. Examine the table below.

What we are trying to do is model the table's structure using regular expression. Each row has a pattern to it from left to right. First a date, followed by a five digit number, followed by a name of variable length, followed by a one to two digit number followed by a variable dollar amount. This is the table's structure, a series of rows containing this information, in this order. We can use regular expression to match that pattern of data. First, we might break it down further and look at patterns to target each column, one after the other. That could look like something below.

Regular expression	Match found as seen on image	Notes
`\d{1,2}/\d{1,2}/\d{4}`		This pattern is just one of may ways to find this specific date format. This pattern would need to be adjusted to find other formats such as "month-day-year".
`\d{5}`		A simple regex to only find a five digit number.
`[A-Z]+`		This pattern also matches words outside of the row, but that's ok. We will put this all together and only match the row shortly.
`\d{1,2}`		Same with this pattern.
`[$]\d{1,3}(,\d{3})?[.]\d{2}`		Again, one of many ways to match currency values.

Now that we can match every cell in the row, we just need to put all that information together and match the whole row. We can enable "tab marking" in a pattern to insert the "\t" control character between segments of text separated by a large amount of whitespace (larger than a normal space character). We can use this tab character to make a regular expression pattern seen below to match the entire row.

Regular expression for the entire row	`\d{1,2}/\d{1,2}/\d{4}\t\d{5}\t\w+\t\d{1,2}\t[$]\d{1,3}(,\d{3})?[.]\d{2}`
Each row matches the single expression

Once you are able to match each row, the next part is to extract the data in that row corresponding to each row. This can be done in a number of ways, but an important thing to point out is what you're trying to match. Once Grooper finds a row using the Row Extractor, it creates an "instance" of that extracted row. You can think of this as a sub-element within the whole document. You still use regular expression to match the information within the row, but instead of matching against the entire document, you are only matching against this sub-element (or "instance"). This can trip users up who are trying to match data against the entire document. Only the portion of the text data in the extracted row is contained in the instance.

A single row matched on the document.

The match seen in the whole document's text data.

The text data in the instance of that matched row.
This is the text data you will use to pattern match each column's value. Notice you no longer have the control characters `\r` and `\n` at the end of the line. Those characters are part of the whole document's text data, but not the instance's

Just like there are many different extraction techniques to match each row, there are multiple ways table cell data is extracted from these row instances. Refer to the "How To" sections of this article for more information on how to set up a Data Table using Row Match and configure Data Columns to extract table data from a document.

Version Differences

The Row Match method is the oldest and original method of extracting table data in Grooper. As such, not much has changed from previous versions. However, there are two other table extract methods available in Grooper now, Header-Value and Infer Grid. These methods provide functionality either not available to Row Match, or make configuring table extraction much simpler for certain use cases.

Use Cases

The Row Match method is perfectly suited for fairly simple table structures. Spreadsheets with or without table lines are great candidates for the Row Match method. It is often the first "go-to" method of many Grooper designers due to its simplicity to set up and configure.

If you are processing a large number of the same report (or even very similar reports in some cases) over and over where the table structure remains consistent, Row Match is usually the best way to target and extract tabular data. It's easy to set up (often just using a single regular expression pattern). There are most often fewer objects to create and configure than other methods. This gives it the benefit of being computationally efficient as well.

Row Match has the added benefit of not relying on labels for each column header to model the table's structure. While most tables do have column headers, some don't. Row Match can produce results in those cases where other methods such as Header-Value rely on that information.

However, once the table's structure starts to change from document to document, a different approach may be needed. Different parties are going to structure tables however they want, which is well outside your control. Think of all the different ways an invoice can be structured. While the information you want is present in all the different tables, how that data is presented may not be consistent. Even just the column location changing can present problems for this method. A row extractor using a single pattern may not be able to do the job (or a complicated row extractor accounting for multiple row formats may need to be used). For these situations, the Header-Value method may be easier to configure and produce better results.

Optional data columns, where values may or may not be present in a cell can complicate things as well. Again, a simple row extractor using a single regex pattern may not do the trick. While a more complicated extractor may successfully extract the table's information, the Header-Value or Infer Grid methods may be simpler to set up and produce the same or even better results.

These are different Oil and Gas Production Reports from various sources. Each one organizes information differently into tables in different ways. Row Match would work just fine for each individual document. However, while the same information exists on each document, there's enough variability in the table structures that Row Match may not be suited for processing the whole document set (Header-Value produces a better result).