2023:Pattern Match (Value Extractor): Difference between revisions

From Grooper Wiki
m Dgreenwood moved page 2023:Pattern Match (Extractor Type) to 2023:Pattern Match (Value Extractor) without leaving a redirect
 
(53 intermediate revisions by 3 users not shown)
Line 1: Line 1:
{|cellpadding=10 cellspacing=5 style="margin:12px"
{{AutoVersion}}
|-style="background-color:#ed2330; color:white"
 
|style="font-size:14pt"|'''WIP'''
<blockquote>{{#lst:Glossary|Pattern Match}}</blockquote>
 
{|class="download-box"
|
[[File:Asset 22@4x.png]]
|
|
This article is a work-in-progress or created as a placeholder for testing purposes. This article is subject to change and/or expansionIt may be incomplete, inaccurate, or stop abruptly.
You may download the ZIP(s) below and upload it into your own Grooper environment (version 2023). The first contains one or more '''Batches''' of sample documentsThe second contains one or more '''Projects''' with resources used in examples throughout this article.  
 
* [[Media:2023_Wiki_Pattern-Match_Batch.zip]]
This tag will be removed upon draft completion.
* [[Media:2023_Wiki_Pattern-Match_Project.zip]]
|}
|}
<blockquote>'''''Pattern Match''''' is an '''''Extractor Type''''' found in Grooper. This extractor primarily uses regular expression (regex) for general data extraction.</blockquote>


== About ==
== About ==
'''''Pattern Match''''' is one of the most commonly used extractors for general data. As per its name, it extracts data from a document matching a regex pattern entered into the '''''Value Pattern'''''.
'''''Pattern Match''''' is one of the most commonly used extractors. As per its name, it extracts data from a document matching a regex pattern entered into the '''''Value Pattern'''''.


This extractor is useful when you want to extract text data matching a particular pattern across a document, such as dates or social security numbers. For example, the format MM/DD/YYYY can be matched with the regex pattern: <code>\d{2}/\d{2}/\d{4}</code>.
This extractor is useful when you want to extract text data matching a particular pattern across a document, such as dates or social security numbers. For example, the format MM/DD/YYYY can be matched with the regex pattern: <code>\d{2}/\d{2}/\d{4}</code>.
Line 19: Line 21:
== How To ==
== How To ==


'''''Pattern Match''''' can be configured on both ''Data Type'' and ''Value Reader'' objects.
'''''Pattern Match''''' can be configured on both '''Data Type''' and '''Value Reader''' objects.


=== Configuring by Object Type ===
=== Configuring by Object Type ===
Line 28: Line 30:
|valign=top style="width:40%"|
|valign=top style="width:40%"|
# Create or select your '''Value Reader'''.  
# Create or select your '''Value Reader'''.  
#* Note the three tabs, "Value Reader", "Tester", and "Advanced".  
#* Note the three tabs: "Value Reader", "Tester", and "Advanced".  
# Select the "Value Reader" tab.  
# Select the "Value Reader" tab.  
# Select the drop-down icon on the far right to the far right of the '''''Extractor''''' property.  
# Select the drop-down icon on the far right to the far right of the '''''Extractor''''' property.  
Line 37: Line 39:
#<li value=4> On the drop-down menu, select '''''Pattern Match'''''.  
#<li value=4> On the drop-down menu, select '''''Pattern Match'''''.  
|
|
[[File:2023 Pattern Match Value Reader Step 4(1) Copy.png]]
[[File:2023 Pattern Match Value Reader Steps 4 Copy.png]]
|-
|-
|valign=top style="width:40%"|
|valign=top style="width:40%"|
#<li value=5> Click the "Tester" tab.
#<li value=5> Click the "Tester" tab.
# In the '''''Value Pattern''''' box, enter the regex pattern for the text you wish to extract.
# In the '''''Value Pattern''''' box, enter the regex pattern for the text you wish to extract.
# Matched data will be highlighted in green and show up in the "Values" panel beneath the Document Viewer.
|
|
[[File:2023 Pattern Match Value Reader Steps 5-6 Copy.png]]
[[File:2023 Pattern Match Value Reader Parts 7 Copy 2.png]]
|-
|-
|valign=top style="width:40%"|
|valign=top style ="width:40%"|
#<li value=7> Matched data will be highlighted in green and show up in the "Values" panel beneath the Document Viewer.
#<li value=8> Save changes.
|
|
[[File:2023 Pattern Match Value Reader Parts 7 Copy 2.png]]
[[File:2023 Pattern Match Value Reader Step 8 Copy.png]]
|-
|
|
|}
|}
Line 68: Line 72:
|valign=top style="width40%"|
|valign=top style="width40%"|
#<li value=3> Select ''Pattern Match'' from the dropdown menu.   
#<li value=3> Select ''Pattern Match'' from the dropdown menu.   
# Select the ellipses to the far right of the '''''Local Extractor'''''.
|
|
[[File:2023 Pattern Match Data Type Steps 3 and 4 Copy.png]]
[[File:2023 Pattern Match Data Type Step 3 Copy.png]]
|-
|valign=top style="width:40%"|
#<li value=4> Select the ellipses to the far right of the '''''Local Extractor'''''.
|
[[File:2023 Pattern Match Data Type Step 4 (1) Copy.png]]
|-
|-
|valign=top style="width:40%"|
|valign=top style="width:40%"|
This will bring up the Extractor Editor window
This will bring up the Extractor Editor window
#<li value=5>  Enter a pattern for the text you would like to extract.
#<li value=5>  Enter a pattern for the text you would like to extract.
#Just like with the '''Value Reader''', matched data will be highlighted in green and appear in the "Values" panel beneath the Document Viewer.
# Once you've entered your pattern, and are satisfied with the results, click "OK".
# Once you've entered your pattern, and are satisfied with the results, click "OK".
|
|
[[File:2023 Pattern Match Data Type Steps 5 and 6 Copy.png]]
[[File:2023 Pattern Match Data Type Steps 5 through 7 Copy.png]]
|-
|-
|valign=top style ="width:40%"|
|valign=top style ="width:40%"|
#<li value=7> Just like with the '''Value Reader''', matched data will be highlighted in green and appear in the "Values" panel beneath the Document Viewer.
#<li value=8> Save changes.
|
|
[[File:2023 Pattern Match Data Type Step 7 Copy.png]]
[[File: 2023 Pattern Match Data Type Step 8 Copy.png]]
|-
|
|
</tab>
</tab>
<tab name="Configuring on Other Object Types" style="margin:20px">
The '''''Pattern Match''''' extractor can be used on a multitude of object types. Any object that has an extractor property can be configured with a '''''Pattern Match'''''.
The configuration process on other objects is identical to both the '''Value Reader''' and '''Data Type''' objects. Simply select a '''''Pattern Match''''' as your extractor.
Examples where you can use a '''''Pattern Match''''' include:
*A '''Data Type''''s '''''Value Extractor''''' property
*A '''Document Type''''s '''''Positive Extractor''''' property
*The '''''Labeled Value''''' extractor's '''''Label Extractor''''' and '''''Value Extractor''''' property
*The '''''Pattern-Based Separation Provider''''''s '''''Value Extractor''''' property
</tab>
[[#Configuring by Object Type|Click here to return to the top of the section]]
</tabs>
</tabs>


=== Regex Examples for Pattern Match ===
=== Regex Examples for Pattern Match ===
<tabs style="margin:20px">
<tabs style="margin:20px">
<tab name="Social Security Numbers/Employer Identification Numbers" style="margin:20px">
==== Social Security Numbers (SSN)/Employer Identification Numbers (EIN) ====
SSNs and EINs are simple. As usual, note the type of number used. A SSN is structured ###-##-####, and an EIN is ##-#######. Simply enter the pattern of the data you wish to extract.
{|cellpadding=10 cellspacing=5
|valign=top style="width:40%"|
# SSN:
#* <code>\d{3}-\d{2}-\d{4}</code>
|
[[File:2023 Pattern Match SSN EIN Match Step 1 Copy.png]]
|-
|valign=top style="width:40%"|
#<li value=2>EINs will be:
#*<code>\d{2}-\d{7}</code>.
|
[[File:2023 Pattern Match SSN EIN Match Step 2 Copy.png]]
|}
</tab>
<tab name="Dates" style="margin:20px">
<tab name="Dates" style="margin:20px">
==== Dates ====
==== Dates ====
Line 94: Line 137:
|valign=top style="width:40%"|
|valign=top style="width:40%"|
Take note of the format of the date(s) on the document. The document here has dates in both the MM/DD/YYYY and MM/DD/YY format. Thus, we will write a regex pattern that will extract both dates.
Take note of the format of the date(s) on the document. The document here has dates in both the MM/DD/YYYY and MM/DD/YY format. Thus, we will write a regex pattern that will extract both dates.
# Enter the regex pattern to extract the date.
# First, enter  
# First, enter  
#*<code>\d{2}/\d{2}/\d{4}  </code>
#*<code>\d{2}/\d{2}/\d{4}  </code>
# Notice that only the first date was returned.
# Notice that only the first date was returned.
|
|
[[File:2023 Pattern Match Date Match Steps 2 and 3 Copy.png]]
[[File:2023 Pattern Match Date Match Steps 1 and 2 Two Arrows (1) Copy.png]]
|-
|-
|valign=top style="width:40%"|
|valign=top style="width:40%"|
#<li value=4> Now try:
#<li value=3> Now try:
#*<code>\d{2}/\d{2}/\d{2}</code>
#*<code>\d{2}/\d{2}/\d{2}</code>
# Kind of picks up both, except that the last two digits in the year of the first date aren't returned. So, this regex pattern won't work either.
# Kind of picks up both, except that the last two digits in the year of the first date aren't returned. So, this regex pattern won't work either.
# So, how are we going to return both dates completely? Keep in mind that you can dictate a range of values within the curly braces. Hence:
|
[[File:2023 Pattern Match Date Match Steps 3 and 4 Two Arrows Copy.png]]
|-
|valign=top style="width:40%"|
#<li value=5> So, how are we going to return both dates completely? Keep in mind that you can dictate a range of values within the curly braces. Hence:
#*<code>\d{2}/\d{2}/\d{2,4}</code>
#*<code>\d{2}/\d{2}/\d{2,4}</code>
#** <code>\d{2,4}</code> tells Grooper to look for anywhere from two to four digits for the year. Since YY and YYYY fall within the range set, the regex pattern will extract them.
#** <code>\d{2,4}</code> tells Grooper to look for anywhere from two to four digits for the year. Since YY and YYYY fall within the range set, the regex pattern will extract them.
# Notice that both dates are now being returned in full.
# Notice that both dates are now being returned in full.
|
|
[[File:2023 Pattern Match Extracting Data Dates Screenshot Copy.png]]
[[File:2023 Pattern Match Date Match Steps 5 and 6 Two Arrows Copy.png]]
|
|
|}
|}
Line 118: Line 164:
<tab name = "Currency" style="margin:20px">
<tab name = "Currency" style="margin:20px">
==== Currency ====
==== Currency ====
For this example, the pattern provided will match all currency data listed.
One of the most important things about currency values is to note the amounts listed—hundreds of dollars, thousands of dollars, as well as cent amounts and dollar signs. If dollar signs are provided, precede them with a backslash, <code>\$</code>, as the dollar sign by itself represents the end of a string in regex. Also, when writing pattern matches for currency, look for both the largest and smallest amounts provided, as this will determine the range for the placeholders.
 
For this example, you will build a pattern that will match all currency data listed.
{|cellpadding=10 cellspacing=5
{|cellpadding=10 cellspacing=5
|valign=top style="width:40%"|
|valign=top style="width:40%"|
# Note the amount listed, as well as any cent amounts and dollar signs being given.
# If dollar signs are given, code them out by preceding them with a backslash: <code>\$</code>.
#* $ by itself represents the end of a string in regex.
# When writing the pattern for currencies, look for the largest and smallest amounts provided, as this will determine the range for the placeholders.
# To extract the first three data instances, enter the following pattern:  
# To extract the first three data instances, enter the following pattern:  
#*<code>[$]\d{1,3}\.\d{2}</code>
#*<code>\$\d{1,3}\.\d{2}</code>
#**Notice that the dollar sign has been escaped by the backslash, as it is part of the text data.
|
|
[[File:2023 Pattern Match Currency Step 4(1) Copy.png]]
[[File:2023 Pattern Match Currency Step 1HiLite Copy.png]]
|-
|-
|valign=top style="width:40%"|
|valign=top style="width:40%"|
#<li value=5> To extract the remaining three data instances, you will need to account for the thousandths, ten-thousandths, and hundred-thousandths place; thus, the pattern will be the same as it was for the tens/hundreds/thousandths place: <code>\d{1,3}</code>
To extract the three remaining instances, look at the way they're written. Anywhere from one (1) to three (3) digits after the dollar sign, three (3) digits after the comma, and cent amounts provided.
# Your regex pattern should look like this:  
#<li value=2>Thus, your regex pattern should look like this:  
#*<code>\$\d{1,3},\d{1,3}\.\d{2}</code>
#*<code>\$\d{1,3},\d{3}\.\d{2}</code>
#* Note that while the last three pieces of date are matched, the first three are no longer being picked up.
#* Note that while the last three pieces of date are matched, the first three are no longer being picked up.
|
|
[[File:2023 Pattern Match Currency Match Step 6(1) Copy.png]]
[[File:2023 Pattern Match Currency Match Step 2 HiLite Copy.png]]
|-
|-
|valign=top style ="width:40%"|
|valign=top style ="width:40%"|
#<li value=7> Thus, to return all data, add parenthesis around <code>\d{1,3},</code> and follow with a question mark:  
#<li value=3> Thus, to return all data, add parenthesis around <code>,\d{3}</code> and follow with a question mark:  
#*<code>\$\d{1,3}(,\d{3})?\.\d{2}</code>
#*<code>\$\d{1,3}(,\d{3})?\.\d{2}</code>
#**By encasing <code>,\d{3}</code> in parenthesis, you've created a Capture Group. For more information on Capture Groups, click here: [https://regexone.com/lesson/capturing_groups RegexOne]
#**The question mark is a Lazy Quantifier, meaning that its job is to collect one to many instances of the data immediately preceding (or following) it. In this case, it will catch one to many instances of the <code>,\d{3}</code> Capture Group.
#*If you're unsure of how large your currency amounts will be, you can substitute the question mark <code>?</code> with a star <code>*</code> character.
#*If you're unsure of how large your currency amounts will be, you can substitute the question mark <code>?</code> with a star <code>*</code> character.
#**<code>\$\d{1,3}(,\d{3})*\.\d{2}</code>
#**<code>\$\d{1,3}(,\d{3})*\.\d{2}</code>
#**The star is another quantifier, designed to capture zero to many instances of preceding data.
#*** For more information on quantifiers, click here: [[https://regexone.com/lesson/capturing_groups RegexOne]]
|
|
[[File:2023 Pattern Match Currency Match Step 7 Copy.png]]
[[File:2023 Pattern Match Currency Match Step 3 HiLite Copy.png]]
|}
|-
</tab>
 
<tab name="Social Security Numbers/Employer Identification Numbers" style="margin:20px">
==== Social Security Numbers (SSN)/Employer Identification Numbers (EIN) ====
{|cellpadding=10 cellspacing=5
|valign=top style="width:40%"|
# Note the format of the SSN/EIN
## SSNs will be ###-##-####, EINs are ##-#######
# Enter the pattern that will match the data you wish to extract.
# SSNs will  be:
#* <code>\d{3}-\d{2}-\d{4}</code>
#EINs will be:
#*<code>\d{2}-\d{7}</code>.
|
[[File:2023 Pattern Match Extracting Data SSN EIN Copy.png]]
|
|
|}
|}
</tab>
</tab>
[[#Regex Examples for Pattern Match|Click here to return to the top of the section]]
</tabs>
</tabs>


Line 171: Line 206:
For example, let's say that you want to extract data on its own line, like the title of a section. While you can enter just the title, you might get false positives if the word(s) that make up the title appear anywhere else on the document. Thus, your '''''Prefix''''' and '''''Suffix Patterns''''' will be:
For example, let's say that you want to extract data on its own line, like the title of a section. While you can enter just the title, you might get false positives if the word(s) that make up the title appear anywhere else on the document. Thus, your '''''Prefix''''' and '''''Suffix Patterns''''' will be:


'''''Prefix Pattern''''':<code>[\n\t]</code>
'''''Prefix Pattern''''':<code>[\n\t]|^</code>


'''''Suffix Pattern''''':<code>[\r\t]</code>
'''''Suffix Pattern''''':<code>[\r\t]|$</code>
* The <code>^</code> character matches the beginning of a string of text.
* The <code>$</code> character matches the end of a string of text.


== See Also: ==
== See Also: ==
* [[Value Reader]]
* [[Value Reader]]

Latest revision as of 15:59, 27 August 2025

This article is about an older version of Grooper.

Information may be out of date and UI elements may have changed.

20252023

Pattern Match is a Value Extractor that extracts values from a document that match a specified regular expression, providing data collection following a known format or pattern.

You may download the ZIP(s) below and upload it into your own Grooper environment (version 2023). The first contains one or more Batches of sample documents. The second contains one or more Projects with resources used in examples throughout this article.

About

Pattern Match is one of the most commonly used extractors. As per its name, it extracts data from a document matching a regex pattern entered into the Value Pattern.

This extractor is useful when you want to extract text data matching a particular pattern across a document, such as dates or social security numbers. For example, the format MM/DD/YYYY can be matched with the regex pattern: \d{2}/\d{2}/\d{4}.

For more information on regex, click the following link: RegexOne

How To

Pattern Match can be configured on both Data Type and Value Reader objects.

Configuring by Object Type

Configuring on a Value Reader

  1. Create or select your Value Reader.
    • Note the three tabs: "Value Reader", "Tester", and "Advanced".
  2. Select the "Value Reader" tab.
  3. Select the drop-down icon on the far right to the far right of the Extractor property.

  1. On the drop-down menu, select Pattern Match.

  1. Click the "Tester" tab.
  2. In the Value Pattern box, enter the regex pattern for the text you wish to extract.
  3. Matched data will be highlighted in green and show up in the "Values" panel beneath the Document Viewer.

  1. Save changes.

Configuring on a Data Type

The Data Type is a little more involved when picking out Pattern Match.

  1. Create or select your Data Type.
  2. Select the drop-down icon to the far right of Local Extractor.

  1. Select Pattern Match from the dropdown menu.

  1. Select the ellipses to the far right of the Local Extractor.

This will bring up the Extractor Editor window

  1. Enter a pattern for the text you would like to extract.
  2. Just like with the Value Reader, matched data will be highlighted in green and appear in the "Values" panel beneath the Document Viewer.
  3. Once you've entered your pattern, and are satisfied with the results, click "OK".

  1. Save changes.

The Pattern Match extractor can be used on a multitude of object types. Any object that has an extractor property can be configured with a Pattern Match.

The configuration process on other objects is identical to both the Value Reader and Data Type objects. Simply select a Pattern Match as your extractor.


Examples where you can use a Pattern Match include:

  • A Data Type's Value Extractor property
  • A Document Type's Positive Extractor property
  • The Labeled Value extractor's Label Extractor and Value Extractor property
  • The Pattern-Based Separation Provider's Value Extractor property

Click here to return to the top of the section

Regex Examples for Pattern Match

Social Security Numbers (SSN)/Employer Identification Numbers (EIN)

SSNs and EINs are simple. As usual, note the type of number used. A SSN is structured ###-##-####, and an EIN is ##-#######. Simply enter the pattern of the data you wish to extract.

  1. SSN:
    • \d{3}-\d{2}-\d{4}

  1. EINs will be:
    • \d{2}-\d{7}.

Dates

Take note of the format of the date(s) on the document. The document here has dates in both the MM/DD/YYYY and MM/DD/YY format. Thus, we will write a regex pattern that will extract both dates.

  1. First, enter
    • \d{2}/\d{2}/\d{4}
  2. Notice that only the first date was returned.

  1. Now try:
    • \d{2}/\d{2}/\d{2}
  2. Kind of picks up both, except that the last two digits in the year of the first date aren't returned. So, this regex pattern won't work either.

  1. So, how are we going to return both dates completely? Keep in mind that you can dictate a range of values within the curly braces. Hence:
    • \d{2}/\d{2}/\d{2,4}
      • \d{2,4} tells Grooper to look for anywhere from two to four digits for the year. Since YY and YYYY fall within the range set, the regex pattern will extract them.
  2. Notice that both dates are now being returned in full.

Currency

One of the most important things about currency values is to note the amounts listed—hundreds of dollars, thousands of dollars, as well as cent amounts and dollar signs. If dollar signs are provided, precede them with a backslash, \$, as the dollar sign by itself represents the end of a string in regex. Also, when writing pattern matches for currency, look for both the largest and smallest amounts provided, as this will determine the range for the placeholders.

For this example, you will build a pattern that will match all currency data listed.

  1. To extract the first three data instances, enter the following pattern:
    • \$\d{1,3}\.\d{2}
      • Notice that the dollar sign has been escaped by the backslash, as it is part of the text data.

To extract the three remaining instances, look at the way they're written. Anywhere from one (1) to three (3) digits after the dollar sign, three (3) digits after the comma, and cent amounts provided.

  1. Thus, your regex pattern should look like this:
    • \$\d{1,3},\d{3}\.\d{2}
    • Note that while the last three pieces of date are matched, the first three are no longer being picked up.

  1. Thus, to return all data, add parenthesis around ,\d{3} and follow with a question mark:
    • \$\d{1,3}(,\d{3})?\.\d{2}
      • By encasing ,\d{3} in parenthesis, you've created a Capture Group. For more information on Capture Groups, click here: RegexOne
      • The question mark is a Lazy Quantifier, meaning that its job is to collect one to many instances of the data immediately preceding (or following) it. In this case, it will catch one to many instances of the ,\d{3} Capture Group.
    • If you're unsure of how large your currency amounts will be, you can substitute the question mark ? with a star * character.
      • \$\d{1,3}(,\d{3})*\.\d{2}
      • The star is another quantifier, designed to capture zero to many instances of preceding data.
        • For more information on quantifiers, click here: [RegexOne]

Click here to return to the top of the section

Prefix and Suffix Patterns

Prefix and Suffix Patterns act as anchors to which you can tether the data you wish to extract. As one would expect, a Prefix Pattern matches what comes before your text matched by regex pattern, a Suffix Pattern is concerned with what comes after.

For example, let's say that you want to extract data on its own line, like the title of a section. While you can enter just the title, you might get false positives if the word(s) that make up the title appear anywhere else on the document. Thus, your Prefix and Suffix Patterns will be:

Prefix Pattern:[\n\t]|^

Suffix Pattern:[\r\t]|$

  • The ^ character matches the beginning of a string of text.
  • The $ character matches the end of a string of text.

See Also: