2023:Pattern Match (Value Extractor)

From Grooper Wiki
Revision as of 16:44, 20 January 2023 by Dsmith (talk | contribs) (→‎How To)
WIP

This article is a work-in-progress or created as a placeholder for testing purposes. This article is subject to change and/or expansion. It may be incomplete, inaccurate, or stop abruptly.

This tag will be removed upon draft completion.

Pattern Match is an Extractor Type found in Grooper. This extractor primarily uses regular expression (regex) for general data extraction.

About

Pattern Match is one of the most commonly used extractors for general data. As per its name, it extracts data from a document matching a regex pattern entered into the Value Pattern.

This extractor is useful when you want to extract text data matching a particular pattern across a document, such as dates or social security numbers. For example, the format MM/DD/YYYY can be matched with the regex pattern: \d{2}/\d{2}/\d{4}.

For more information on regex, click the following link: RegexOne

How To

Pattern Match can be configured on both Data Type and Value Reader objects.

Configuring by Object Type

Configuring on a Value Reader

Upon creating your Value Reader, you will see three tabs, "Value Reader", "Tester", and "Advanced". To create a Value Reader that uses Pattern Match, select the "Value Reader" tab. From there, on the Extractor property, you will select the icon on the far right, and from the drop-down menu, select Pattern Match. Upon selection, click the "Tester" tab, and in the "Value Pattern" box, enter the text, or regex pattern of the text you wish to extract.

Configuring on a Data Type

The Data Type is a little more involved when picking out Pattern Match.

  1. To select the extractor, create your Data Type.
  2. Select the ellipses icon to the far right of Local Extractor.
  3. Select Pattern Match from the dropdown menu.
  4. Select the ellipses to the far right of the Local Extractor.

This will bring up the Extractor Editor window

  1. Enter a pattern for the text you would like to extract.
  2. Once you've entered your pattern, and are satisfied with the results, click "OK".

Extracting Data

Dates

Now that the extractor has been set up, how to go about making it work? Depends on the data you want. Dates for example come in a variety of formats, but for the sake of simplicity, let's say that you have a date written in the format of MM/DD/YYYY. The expression best suited to extracting the date would be \d{2}/\d{2}/\d{4}. MM/DD/YY? \d{2}/\d{2}/\d{2} (TIP:\d and [0-9] both return digits. \d is better than [0-9] for dates as it makes the regex simpler.) For dates written as MM-DD-YY (or MM-DD-YYYY), substitute the forward slashes (/) with hyphens (-).

Currency

Capturing currency can appear trickier than it actually is. For singular pieces of currency data, such as an invoice of $153.25 for example, the regex pattern to use would be: [$]\d{3}[.]\d{2} (TIP: Dollar signs ($) and periods (.) are regex characters with functions outside their appearance. $ signifies the end of a string, and . is a wildcard character used to match anything on the document. If you want to match just a dollar sign, or a period, then either encase the symbol in square braces ([$], [.]), or precede it with a backslash (\$, \.)).

What about multiple pieces of currency data? What if you have a table of totals whose amounts range from $9.99 all the way to $999,999.99? The aforementioned regex pattern is still usable—it just requires a little tweaking.

So, the total amounts are as follows: $9.99, $99.99, $999.99, $9,999.99, $99,999.99, $999,999.99. Taking the regex pattern above as is, [$]\d{3}[.]\d{2} would only match $999.99. If you want to match every amount listed, the proper pattern would be [$](\d{1,3},)?\d{1,3}[.]\d{2}. \d{1,3} defines a range of amounts of placeholders that before the cent amount. You're telling Grooper that your total amount could be anywhere from below ten dollars, to nearly a thousand dollars. As for (\d{1,3},)?, it's important to note the parenthesis and question mark. Without these, the regex pattern would fail, and not return anything. By isolating the first \d{1,3} and , in the parenthesis, Pattern Match recognizes this as a Capture Group, and will return exactly what has been entered between the parentheses (TIP: For more information on Capture Groups, click here:). The question mark (?) is a greedy marker—it tells Grooper that there will be zero or one instances of a number in the tenths, hundred, or hundred-thousandths place. Since all total amounts contain either zero or one instance of amounts greater than $1,000, all total amounts are returned.

The question mark is the most important part of [$](\d{1,3})?\d{1,3}[.]\d{2}. Without it, the only amounts returned would be $9,999.99, $99,999.99, and $999,999.99.

Social Security Numbers (SSN)

Social Security Numbers are similar to dates. Technically, they are simpler, as SSNs don't have as many variations (if any) as dates do. SSNs consist of nine digits, formatted ###-##-####. Thus, the regex used in Pattern Match will be:

\d{3}[-]\d{2}[-]\d{4}

Prefix and Suffix Patterns

Prefix and Suffix Patterns act as anchors to which you can tether the data you wish to extract. As one would expect, a Prefix Pattern matches what comes before your chosen matched data, Suffix Patterns are concerned with what comes after. For patterns that consist of more than one regular expression (\n\t, for example), encase the pattern in square braces. [].

For example, let's say that you want to extract data on its own line, like the title of a section. While you can enter just the title, you might get false positives if the word(s) that make up the title are used anywhere else on the document. Thus, your prefix and suffix patterns will be:

Prefix Pattern:[\n\t]

Suffix Pattern:[\r\t]

See Also: