2023:Pattern Match (Value Extractor): Difference between revisions
| Line 93: | Line 93: | ||
| | | | ||
</tab> | </tab> | ||
<tab name="Configuring on Other Object Types" style="margin:20px"> | |||
</tab> | |||
The '''''Pattern Match''''' extractor can be used on a multitude of object types. Any object that has an extractor property can be configured with a '''''Pattern Match'''''. | |||
The configuration process on other objects is identical to both the '''Value Reader''' and '''Data Type''' objects. Simply select a '''''Pattern Match''''' as your extractor type. | |||
Examples where you can use a '''''Pattern Match''''' include: | |||
*A '''Data Type''''s '''''Value Extractor''''' property | |||
*A '''Document Type''''s '''''Positive Extractor''''' property | |||
*The Labeled Value extractor's Label Extractor property | |||
*The Pattern-Based Separation Provider's Value Extractor property | |||
[[#Regex Examples for Pattern Match|Click here to return to the top of the section]] | [[#Regex Examples for Pattern Match|Click here to return to the top of the section]] | ||
</tabs> | </tabs> | ||
Revision as of 14:11, 2 February 2023
| WIP |
This article is a work-in-progress or created as a placeholder for testing purposes. This article is subject to change and/or expansion. It may be incomplete, inaccurate, or stop abruptly. This tag will be removed upon draft completion. |
Pattern Match is an Extractor Type found in Grooper. This extractor uses regular expression (regex) for general data extraction.
About
Pattern Match is one of the most commonly used extractors. As per its name, it extracts data from a document matching a regex pattern entered into the Value Pattern.
This extractor is useful when you want to extract text data matching a particular pattern across a document, such as dates or social security numbers. For example, the format MM/DD/YYYY can be matched with the regex pattern: \d{2}/\d{2}/\d{4}.
For more information on regex, click the following link: RegexOne
How To
Pattern Match can be configured on both Data Type and Value Reader objects.
Configuring by Object Type
Configuring on a Value Reader
|
|
|
|
|
|
|
|
Configuring on a Data Type
The Data Type is a little more involved when picking out Pattern Match.
|
|
|
|
|
|
|
This will bring up the Extractor Editor window
|
|
|
|
The Pattern Match extractor can be used on a multitude of object types. Any object that has an extractor property can be configured with a Pattern Match.
The configuration process on other objects is identical to both the Value Reader and Data Type objects. Simply select a Pattern Match as your extractor type.
Examples where you can use a Pattern Match include:
- A Data Type's Value Extractor property
- A Document Type's Positive Extractor property
- The Labeled Value extractor's Label Extractor property
- The Pattern-Based Separation Provider's Value Extractor property
Regex Examples for Pattern Match
Social Security Numbers (SSN)/Employer Identification Numbers (EIN)
SSNs and EINs are simple. As usual, note the type of number used. A SSN is structured ###-##-####, and an EIN is ##-#######. Simply enter the pattern of the data you wish to extract.
|
|
|
Dates
|
Take note of the format of the date(s) on the document. The document here has dates in both the MM/DD/YYYY and MM/DD/YY format. Thus, we will write a regex pattern that will extract both dates.
|
||
|
||
|
Currency
One of the most important things about currency values is to note the amounts listed—hundreds of dollars, thousands of dollars, as well as cent amounts and dollar signs. If dollar signs are provided, precede them with a backslash, \$, as the dollar sign by itself represents the end of a string in regex. Also, when writing pattern matches for currency, look for both the largest and smallest amounts provided, as this will determine the range for the placeholders.
For this example, you will build a pattern that will match all currency data listed.
|
|
|
To extract the three remaining instances, look at the way they're written. Anywhere from one (1) to three (3) digits after the dollar sign, three (3) digits after the comma, and cent amounts provided.
|
|
|
|
Prefix and Suffix Patterns
Prefix and Suffix Patterns act as anchors to which you can tether the data you wish to extract. As one would expect, a Prefix Pattern matches what comes before your text matched by regex pattern, a Suffix Pattern is concerned with what comes after.
For example, let's say that you want to extract data on its own line, like the title of a section. While you can enter just the title, you might get false positives if the word(s) that make up the title appear anywhere else on the document. Thus, your Prefix and Suffix Patterns will be:
Prefix Pattern:[\n\t]
Suffix Pattern:[\r\t]
















