2.80:Row Match (Table Extract Method): Difference between revisions

Revision as of 11:41, 17 January 2020

Row Match is one of three methods available to Data Table data elements to extract information from tables on a document set. It uses regular expression pattern matching to determine a tables structure based on the pattern of each row and extract cell data from each column.

About

The configuration of Row Match is relatively simple. As you can see in the image below, there is only one property to configure, the "Row Extractor". Just like any other extractor, it can be an Internal pattern or a Reference to an extractor in the Node Tree.

This extractor is made to find the general pattern of each row in a table. There are many ways to accomplish this. We will look at a simple example with a fairly simple solution. Examine the table below.

Each row has a pattern we can match using regular expression. First a date, followed by a five digit number, followed by a name of variable length, followed by a one to two digit number followed by a variable dollar amount. First, we might look at patterns to target each column, one after the other. That might look like something below.

Regular expression	Match found as seen on image	Notes
`\d{1,2}/\d{1,2}/\d{4}`		This pattern is just one of may ways to find this specific date format. This pattern would need to be adjusted to find other formats such as "month-day-year".
`\d{5}`		A simple regex to only find a five digit number.
`[A-Z]+`		This pattern also matches words outside of the row, but that's ok. We will put this all together and only match the row shortly.
`\d{1,2}`		Same with this pattern.
`[$]\d{1,3}(,\d{3})?[.]\d{2}`		Again, one of many ways to match currency values.

Now that we can match every cell in the row, we just need to put all that information together and match the whole row. We can enable "tab marking" in a pattern to insert the "\t" control character between segments of text separated by a large amount of whitespace (larger than a normal space character). We can use this tab character to make a regular expression pattern seen below to match the entire row.

Regular expression	`\d{1,2}/\d{1,2}/\d{4}\t\d{5}\t\w+\t\d{1,2}\t[$]\d{1,3}(,\d{3})?[.]\d{2}`
Each row matches the single expression

Once you are able to match each row, the next part is to extract the data in that row corresponding to each row. This can be done in a number of ways, but an important thing to point out is what you're trying to match. Once Grooper finds a row using the Row Extractor, it creates an "instance" of that extracted row. You can think of this as a sub-element within the whole document. You still use regular expression to match the information within the row, but instead of matching against the entire document, you are only matching against this sub-element (or "instance"). This can trip users up who are trying to match data against the entire document. Only the portion of the text data in the extracted row is contained in the instance.

A single row matched on the document.

The match seen in the whole document's text data.

The text data in the instance of that matched row.
This is the text data you will use to pattern match each column's value.

Version Differences

The Row Match method is the oldest and original method of extracting table data in Grooper. As such, not much has changed from previous versions. However, there are two other table extract methods available in Grooper now, Header-Value and Infer Grid. These methods provide functionality either not available to Row Match, or make configuring table extraction much simpler for certain use cases.

Use Cases

The Row Match method is perfectly suited for fairly simple table structures.

@@ Line 13: / Line 13: @@
 |}
-This extractor is made to find the general pattern of each row in a table.  Examine the table below.
+This extractor is made to find the general pattern of each row in a table.  There are many ways to accomplish this.  We will look at a simple example with a fairly simple solution.  Examine the table below.
 [[File:Simpletable.png|center|500px]]
@@ Line 20: / Line 20: @@
 {|style="margin:auto" cellpadding="10" cellspacing="5"
-|'''Regular expression'''||'''Match found on image'''||'''Notes'''
+|'''Regular expression'''||'''Match found as seen on image'''||'''Notes'''
 |-
-|style="width:15%"|<code>\d{1,2}/\d{1,2}/(\d{2}|\d{4})</code>||[[file:row match col1.png|475px]]||This pattern is just one of may ways to find this specific date format.  This pattern would need to be adjusted to find other formats such as "month-day-year".
+|style="width:15%"|<code>\d{1,2}/\d{1,2}/\d{4}</code>||[[file:row match col1.png|475px]]||This pattern is just one of may ways to find this specific date format.  This pattern would need to be adjusted to find other formats such as "month-day-year".
 |-
 |<code>\d{5}</code>||[[file:row match col2.png|475px]]||A simple regex to only find a five digit number.
@@ Line 35: / Line 35: @@
 Now that we can match every cell in the row, we just need to put all that information together and match the whole row.  We can enable "tab marking" in a pattern to insert the "\t" control character between segments of text separated by a large amount of whitespace (larger than a normal space character).  We can use this tab character to make a regular expression pattern seen below to match the entire row.
+{|style="margin:auto" cellpadding="10" cellspacing="5"
+|-
+|'''Regular expression'''|||<code>\d{1,2}/\d{1,2}/\d{4}\t\d{5}\t\w+\t\d{1,2}\t[$]\d{1,3}(,\d{3})?[.]\d{2}</code>
+|-
+|'''Each row matches the single expression'''||[[File:Row match whole table matched.png|center|475px]]
+|}
+Once you are able to match each row, the next part is to extract the data in that row corresponding to each row.  This can be done in a number of ways, but an important thing to point out is what you're trying to match.  Once Grooper finds a row using the Row Extractor, it creates an "instance" of that extracted row.  You can think of this as a sub-element within the whole document.  You still use regular expression to match the information within the row, but instead of matching against the entire document, you are only matching against this sub-element (or "instance").  This can trip users up who are trying to match data against the entire document.  Only the portion of the text data in the extracted row is contained in the instance.
+{|cellpadding="10" cellspacing="5"
+|-valign="top"
+|
+{|
+|+A single row matched on the document.
+|[[file:row match whole table.png]]
+|}
+|
+{|
+|+The match seen in the whole document's text data.
+|[[file:row match whole text2.png|border]]
+|}
+|
+{|
+|+The text data in the instance of that matched row.
+|[[File:Row match instance text.png|frame|This is the text data you will use to pattern match each column's value.]]
+|}
+|}
 == Version Differences ==