2023:Pattern Match (Value Extractor): Difference between revisions

Revision as of 16:02, 1 February 2023

WIP

This article is a work-in-progress or created as a placeholder for testing purposes. This article is subject to change and/or expansion. It may be incomplete, inaccurate, or stop abruptly.

This tag will be removed upon draft completion.

Pattern Match is an Extractor Type found in Grooper. This extractor uses regular expression (regex) for general data extraction.

About

Pattern Match is one of the most commonly used extractors. As per its name, it extracts data from a document matching a regex pattern entered into the Value Pattern.

This extractor is useful when you want to extract text data matching a particular pattern across a document, such as dates or social security numbers. For example, the format MM/DD/YYYY can be matched with the regex pattern: \d{2}/\d{2}/\d{4}.

For more information on regex, click the following link: RegexOne

How To

Pattern Match can be configured on both Data Type and Value Reader objects.

Configuring by Object Type

Configuring on a Value ReaderConfiguring on a Data Type

Configuring on a Value Reader

Create or select your Value Reader. Note the three tabs: "Value Reader", "Tester", and "Advanced". Select the "Value Reader" tab. Select the drop-down icon on the far right to the far right of the Extractor property.
On the drop-down menu, select Pattern Match.
Click the "Tester" tab. In the Value Pattern box, enter the regex pattern for the text you wish to extract. Matched data will be highlighted in green and show up in the "Values" panel beneath the Document Viewer.

Configuring on a Data Type

The Data Type is a little more involved when picking out Pattern Match.

Create or select your Data Type. Select the drop-down icon to the far right of Local Extractor.
Select Pattern Match from the dropdown menu.
Select the ellipses to the far right of the Local Extractor.
This will bring up the Extractor Editor window Enter a pattern for the text you would like to extract. Just like with the Value Reader, matched data will be highlighted in green and appear in the "Values" panel beneath the Document Viewer. Once you've entered your pattern, and are satisfied with the results, click "OK".

Regex Examples for Pattern Match

DatesCurrencySocial Security Numbers/Employer Identification Numbers

Dates

Take note of the format of the date(s) on the document. The document here has dates in both the MM/DD/YYYY and MM/DD/YY format. Thus, we will write a regex pattern that will extract both dates.

First, enter
- \d{2}/\d{2}/\d{4}
Notice that only the first date was returned.

Now try:
- \d{2}/\d{2}/\d{2}
Kind of picks up both, except that the last two digits in the year of the first date aren't returned. So, this regex pattern won't work either.

So, how are we going to return both dates completely? Keep in mind that you can dictate a range of values within the curly braces. Hence:
- \d{2}/\d{2}/\d{2,4}
  - \d{2,4} tells Grooper to look for anywhere from two to four digits for the year. Since YY and YYYY fall within the range set, the regex pattern will extract them.
Notice that both dates are now being returned in full.

Currency

One of the most important things to note when it comes to currency is to note the amounts listed—hundreds of dollars, thousands of dollars, as well as cent amounts and dollar signs. If dollar signs are provided, precede them with a backslash, \$, as the dollar sign by itself represents the end of a string in regex. Also, when writing pattern matches for currency, look for both the largest and smalles amounts provided, as this will determine the range for the placeholders.

For this example, the pattern will match all currency data listed.

To extract the first three data instances, enter the following pattern:
- \$\d{1,3}\.\d{2}

To extract the three remaining instances, look at the way they're written. Anywhere from one (1) to three (3) digits after the dollar sign, three (3) digits after the comma, and cent amounts provided.

Thus, your regex pattern should look like this:
- \$\d{1,3},\d{3}\.\d{2}
- Note that while the last three pieces of date are matched, the first three are no longer being picked up.

Thus, to return all data, add parenthesis around \d{3}, and follow with a question mark:
- \$\d{1,3}(,\d{3})?\.\d{2}
  - By encasing ,\d{3} in parenthesis, you've created a Capture Group. For more information on Capture Groups, click here: RegexOne
- If you're unsure of how large your currency amounts will be, you can substitute the question mark ? with a star * character.
  - \$\d{1,3}(,\d{3})*\.\d{2}

Social Security Numbers (SSN)/Employer Identification Numbers (EIN)

SSNs and EINs are simple. As usual, note the type of number used. A SSN is structured ###-##-####, and an EIN is ##-#######. Simply enter the pattern of the data you wish to extract.

SSN: `\d{3}-\d{2}-\d{4}`
EINs will be: `\d{2}-\d{7}`.

Click here to return to the top of the section

Prefix and Suffix Patterns

Prefix and Suffix Patterns act as anchors to which you can tether the data you wish to extract. As one would expect, a Prefix Pattern matches what comes before your text matched by regex pattern, a Suffix Pattern is concerned with what comes after.

For example, let's say that you want to extract data on its own line, like the title of a section. While you can enter just the title, you might get false positives if the word(s) that make up the title appear anywhere else on the document. Thus, your Prefix and Suffix Patterns will be:

Prefix Pattern:[\n\t]

Suffix Pattern:[\r\t]

@@ Line 163: / Line 163: @@
 |}
 </tab>
+[[#Regex Examples for Pattern Match|Click here to return to the top of the section]]
 </tabs>