2023:Pattern Match (Value Extractor): Difference between revisions

Revision as of 17:04, 26 January 2023

WIP

This article is a work-in-progress or created as a placeholder for testing purposes. This article is subject to change and/or expansion. It may be incomplete, inaccurate, or stop abruptly.

This tag will be removed upon draft completion.

Pattern Match is an Extractor Type found in Grooper. This extractor primarily uses regular expression (regex) for general data extraction.

About

Pattern Match is one of the most commonly used extractors for general data. As per its name, it extracts data from a document matching a regex pattern entered into the Value Pattern.

This extractor is useful when you want to extract text data matching a particular pattern across a document, such as dates or social security numbers. For example, the format MM/DD/YYYY can be matched with the regex pattern: \d{2}/\d{2}/\d{4}.

For more information on regex, click the following link: RegexOne

How To

Pattern Match can be configured on both Data Type and Value Reader objects.

Configuring by Object Type

Configuring on a Value ReaderConfiguring on a Data Type

Configuring on a Value Reader

Create or select your Value Reader. Note the three tabs, "Value Reader", "Tester", and "Advanced". Select the "Value Reader" tab. Select the drop-down icon on the far right to the far right of the Extractor property.
On the drop-down menu, select Pattern Match.
Click the "Tester" tab. In the Value Pattern box, enter the regex pattern for the text you wish to extract. Matched data will be highlighted in green and show up in the "Values" panel beneath the Document Viewer.

Configuring on a Data Type

The Data Type is a little more involved when picking out Pattern Match.

Create or select your Data Type.
Select the drop-down icon to the far right of Local Extractor.

Select Pattern Match from the dropdown menu.
Select the ellipses to the far right of the Local Extractor.

This will bring up the Extractor Editor window

Enter a pattern for the text you would like to extract.
Once you've entered your pattern, and are satisfied with the results, click "OK".
Just like with the Value Reader, matched data will be highlighted in green and appear in the "Values" panel beneath the Document Viewer.

Extracting Data

DatesCurrencySocial Security Numbers/Employer Identification Numbers

Dates

Take note of the format of the date(s) on the document.
- This document has a date formatted MM/DD/YYY
Enter the regex pattern to extract the date.
For this document, the regex pattern will be:
- \d{2}/\d{2}/\d{4}
- For the MM/DD/YY format, the pattern will instead be \d{2}/\d{2}/\d{2}. For MM-DD-YY/MM-DD-YYY formats, simply substitute the backslashes (/) with hyphens (-):
- \d{2}-\d{2}-\d{2}; \d{2}-\d{2}-\d{4}

Currency

For this example, the pattern provided will match all currency data listed.

Note the amount listed, as well as any cent amounts and dollar signs being given. If dollar signs are given, code them out by preceding them with a backslash: `\$`. $ by itself represents the end of a string in regex. When writing the pattern for currencies, look for the largest and smallest amounts provided, as this will determine the range for the placeholders. To extract the first three data instances, enter the following pattern: `[$]\d{1,3}\.\d{2}`
To extract the remaining three data instances, you will need to account for the thousandths, ten-thousandths, and hundred-thousandths place; thus, the pattern will be the same as it was for the tens/hundreds/thousandths place: `\d{1,3}` Your regex pattern should look like this: `\$\d{1,3},\d{1,3}\.\d{2}` Note that while the last three pieces of date are matched, the first three are no longer being picked up.
Thus, to return all data, add parenthesis around `\d{1,3},` and follow with a question mark: `\$(\d{1,3},)?\d{1,3}\.\d{2}`

Social Security Numbers (SSN)/Employer Identification Numbers (EIN)

Note the format of the SSN/EIN
1. SSNs will be ###-##-####, EINs are ##-#######
Enter the pattern that will match the data you wish to extract.
SSNs will be \d{3}[-]\d{2}[-]\d{4}; EINs will be \d{2}[-]\d{7}.

Prefix and Suffix Patterns

Prefix and Suffix Patterns act as anchors to which you can tether the data you wish to extract. As one would expect, a Prefix Pattern matches what comes before your chosen matched data, Suffix Patterns are concerned with what comes after. For patterns that consist of more than one regular expression (\n\t, for example), encase the pattern in square braces. [].

For example, let's say that you want to extract data on its own line, like the title of a section. While you can enter just the title, you might get false positives if the word(s) that make up the title are used anywhere else on the document. Thus, your prefix and suffix patterns will be:

Prefix Pattern:[\n\t]

Suffix Pattern:[\r\t]

@@ Line 85: / Line 85: @@
 # Take note of the format of the date(s) on the document.
-## This document has a date formatted MM/DD/YYY
+#* This document has a date formatted MM/DD/YYY
 # Enter the regex pattern to extract the date.
-# For this document, the regex pattern will be <code>\d{2}/\d{2}/\d{4}</code>
+# For this document, the regex pattern will be:
-## For the MM/DD/YY format, the pattern will instead be <code>\d{2}/\d{2}/\d{2}</code>. For MM-DD-YY/MM-DD-YYY formats, simply substitute the backslashes (/) with hyphens (-).
+#*<code>\d{2}/\d{2}/\d{4}</code>
+#* For the MM/DD/YY format, the pattern will instead be <code>\d{2}/\d{2}/\d{2}</code>. For MM-DD-YY/MM-DD-YYY formats, simply substitute the backslashes (/) with hyphens (-):
+#*<code>\d{2}-\d{2}-\d{2}</code>; <code>\d{2}-\d{2}-\d{4}</code>
 |
 [[File:2023 Pattern Match Extracting Data Dates Screenshot Copy.png]]
@@ Line 102: / Line 104: @@
 |valign=top style="width:40%"|
 # Note the amount listed, as well as any cent amounts and dollar signs being given.
-# If dollar signs are given, code them out by placing them in between square braces: <code>[$]</code>.
+# If dollar signs are given, code them out by preceding them with a backslash: <code>\$</code>.
-## $ by itself represents the end of a string in regex.
+#* $ by itself represents the end of a string in regex.
 # When writing the pattern for currencies, look for the largest and smallest amounts provided, as this will determine the range for the placeholders.
-# To extract the first three data instances, enter the following pattern: <code>[$]\d{1,3}[.]\d{2}</code>
+# To extract the first three data instances, enter the following pattern:
+#*<code>[$]\d{1,3}\.\d{2}</code>
 |
 [[File:2023 Pattern Match Currency Step 4(1) Copy.png]]
@@ Line 111: / Line 114: @@
 |valign=top style="width:40%"|
 #<li value=5> To extract the remaining three data instances, you will need to account for the thousandths, ten-thousandths, and hundred-thousandths place; thus, the pattern will be the same as it was for the tens/hundreds/thousandths place: <code>\d{1,3}</code>
-# Your regex pattern should look like this: <code>[$]\d{1,3},\d{1,3}[.]\d{2}</code>
+# Your regex pattern should look like this:
-## Note that while the last three pieces of date are matched, the first three are no longer being picked up.
+#*<code>\$\d{1,3},\d{1,3}\.\d{2}</code>
+#* Note that while the last three pieces of date are matched, the first three are no longer being picked up.
 |
 [[File:2023 Pattern Match Currency Match Step 6(1) Copy.png]]
 |-
 |valign=top style ="width:40%"|
-#<li value=7> Thus, to return all data, add parenthesis around <code>\d{1,3},</code> and follow with a question mark: <code>[$](\d{1,3},)?\d{1,3}[.]\d{2}</code>
+#<li value=7> Thus, to return all data, add parenthesis around <code>\d{1,3},</code> and follow with a question mark: <code>\$(\d{1,3},)?\d{1,3}\.\d{2}</code>
 |
 [[File:2023 Pattern Match Currency Match Step 7 Copy.png]]