2023:Pattern Match (Value Extractor): Difference between revisions

From Grooper Wiki
No edit summary
No edit summary
Tag: Reverted
Line 11: Line 11:
* [[Media:2023_Wiki_Pattern-Match_Project.zip]]
* [[Media:2023_Wiki_Pattern-Match_Project.zip]]
|}
|}
== Glossary ==
<u><big>'''Batch'''</big></u>: {{#lst:Glossary|Batch}}
<u><big>'''Data Type'''</big></u>: {{#lst:Glossary|Data Type}}
<u><big>'''Document Type'''</big></u>: {{#lst:Glossary|Document Type}}
<u><big>'''Document Viewer'''</big></u>: {{#lst:Glossary|Document Viewer}}
<u><big>'''Extract'''</big></u>: {{#lst:Glossary|Extract}}
<u><big>'''Extractor Type'''</big></u>: {{#lst:Glossary|Extractor Type}}
<u><big>'''Labeled Value'''</big></u>: {{#lst:Glossary|Labeled Value}}
<u><big>'''Pattern Match'''</big></u>: {{#lst:Glossary|Pattern Match}}
<u><big>'''Pattern-Based Separation'''</big></u>: {{#lst:Glossary|Pattern-Based Separation}}
<u><big>'''Pattern-Based'''</big></u>: {{#lst:Glossary|Pattern-Based}}
<u><big>'''Project'''</big></u>: {{#lst:Glossary|Project}}
<u><big>'''Separation'''</big></u>: {{#lst:Glossary|Separation}}
<u><big>'''Value Reader'''</big></u>: {{#lst:Glossary|Value Reader}}


== About ==
== About ==

Revision as of 11:59, 10 May 2024

This article is about an older version of Grooper.

Information may be out of date and UI elements may have changed.

20252023

Pattern Match is a Value Extractor that extracts values from a document that match a specified regular expression, providing data collection following a known format or pattern.

You may download the ZIP(s) below and upload it into your own Grooper environment (version 2023). The first contains one or more Batches of sample documents. The second contains one or more Projects with resources used in examples throughout this article.

Glossary

Batch: inventory_2 Batch nodes are fundamental in Grooper's architecture. They are containers of documents that are moved through workflow mechanisms called settings Batch Processes. Documents and their pages are represented in Batches by a hierarchy of folder Batch Folders and contract Batch Pages.

Data Type: pin Data Types are nodes used to extract text data from a document. Data Types have more capabilities than quick_reference_all Value Readers. Data Types can collect results from multiple extractor sources, including a locally defined extractor, child extractor nodes, and referenced extractor nodes. Data Types can also collate results using Collation Providers to combine, sift and manipulate results further.

Document Type: description Document Type nodes represent a distinct type of document, such as an invoice or a contract. Document Types are created as child nodes of a stacks Content Model or a collections_bookmark Content Category. They serve three primary purposes:

  1. They are used to classify documents. Documents are considered "classified" when the folder Batch Folder is assigned a Content Type (most typically, a Document Type).
  2. The Document Type's data_table Data Model defines the Data Elements extracted by the Extract activity (including any Data Elements inherited from parent Content Types).
  3. The Document Type defines all "Behaviors" that apply (whether from the Document Type's Behavior settings or those inherited from a parent Content Type).

Document Viewer: The Grooper Document Viewer is the portal to your documents. It is the UI that allows you to see a folder Batch Folder's (or a contract Batch Page's) image, text content, and more.

Extract: export_notes Extract is an Activity that retrieves information from folder Batch Folder documents, as defined by Data Elements in a data_table Data Model. This is how Grooper locates unstructured data on your documents and collects it in a structured, usable format.

Extractor Type:

Labeled Value: Labeled Value is a Value Extractor that identifies and extracts a value next to a label. This is one of the most commonly used extractors to extract data from structured documents (such as a standardized form) and static values on semi-structured documents (such as the header details on an invoice).

Pattern Match: Pattern Match is a Value Extractor that extracts values from a document that match a specified regular expression, providing data collection following a known format or pattern.

Pattern-Based Separation: Pattern-Based Separation is a Separation Provider that creates a new document folder every time a value returned by a defined pattern is encountered on a page.

Pattern-Based: Pattern-Based is a Collation Provider option for pin Data Type extractors. Pattern-Based uses regular expressions to sequence returned results into a final result set.

Project: package_2 Projects are the primary containers for configuration nodes within Grooper. The Project is where various processing objects such as stacks Content Models, settings Batch Processes, profile objects are stored. This makes resources easier to manage, easier to save, and simplifies how node references are made in a Grooper Repository.

Separation: Separation is the process of taking an unorganized inventory_2 Batch of loose contract Batch Pages and organizing them into documents represented by folder Batch Folders in Grooper. This is done so Grooper can later assign a description Document Type to each document folder in a process known as "classification".

Value Reader: quick_reference_all Value Reader nodes define a single data extraction operation. Each Value Reader executes a single Value Extractor configuration. The Value Extractor determines the logic for returning data from a text-based document or page. (Example: Pattern Match is a Value Extractor that returns data using regular expressions).

  • Value Readers are can be used on their own or in conjunction with pin Data Types for more complex data extraction and collation.

About

Pattern Match is one of the most commonly used extractors. As per its name, it extracts data from a document matching a regex pattern entered into the Value Pattern.

This extractor is useful when you want to extract text data matching a particular pattern across a document, such as dates or social security numbers. For example, the format MM/DD/YYYY can be matched with the regex pattern: \d{2}/\d{2}/\d{4}.

For more information on regex, click the following link: RegexOne

How To

Pattern Match can be configured on both Data Type and Value Reader objects.

Configuring by Object Type

Configuring on a Value Reader

  1. Create or select your Value Reader.
    • Note the three tabs: "Value Reader", "Tester", and "Advanced".
  2. Select the "Value Reader" tab.
  3. Select the drop-down icon on the far right to the far right of the Extractor property.

  1. On the drop-down menu, select Pattern Match.

  1. Click the "Tester" tab.
  2. In the Value Pattern box, enter the regex pattern for the text you wish to extract.
  3. Matched data will be highlighted in green and show up in the "Values" panel beneath the Document Viewer.

  1. Save changes.

Configuring on a Data Type

The Data Type is a little more involved when picking out Pattern Match.

  1. Create or select your Data Type.
  2. Select the drop-down icon to the far right of Local Extractor.

  1. Select Pattern Match from the dropdown menu.

  1. Select the ellipses to the far right of the Local Extractor.

This will bring up the Extractor Editor window

  1. Enter a pattern for the text you would like to extract.
  2. Just like with the Value Reader, matched data will be highlighted in green and appear in the "Values" panel beneath the Document Viewer.
  3. Once you've entered your pattern, and are satisfied with the results, click "OK".

  1. Save changes.

The Pattern Match extractor can be used on a multitude of object types. Any object that has an extractor property can be configured with a Pattern Match.

The configuration process on other objects is identical to both the Value Reader and Data Type objects. Simply select a Pattern Match as your extractor type.


Examples where you can use a Pattern Match include:

  • A Data Type's Value Extractor property
  • A Document Type's Positive Extractor property
  • The Labeled Value extractor's Label Extractor and Value Extractor property
  • The Pattern-Based Separation Provider's Value Extractor property

Click here to return to the top of the section

Regex Examples for Pattern Match

Social Security Numbers (SSN)/Employer Identification Numbers (EIN)

SSNs and EINs are simple. As usual, note the type of number used. A SSN is structured ###-##-####, and an EIN is ##-#######. Simply enter the pattern of the data you wish to extract.

  1. SSN:
    • \d{3}-\d{2}-\d{4}

  1. EINs will be:
    • \d{2}-\d{7}.

Dates

Take note of the format of the date(s) on the document. The document here has dates in both the MM/DD/YYYY and MM/DD/YY format. Thus, we will write a regex pattern that will extract both dates.

  1. First, enter
    • \d{2}/\d{2}/\d{4}
  2. Notice that only the first date was returned.

  1. Now try:
    • \d{2}/\d{2}/\d{2}
  2. Kind of picks up both, except that the last two digits in the year of the first date aren't returned. So, this regex pattern won't work either.

  1. So, how are we going to return both dates completely? Keep in mind that you can dictate a range of values within the curly braces. Hence:
    • \d{2}/\d{2}/\d{2,4}
      • \d{2,4} tells Grooper to look for anywhere from two to four digits for the year. Since YY and YYYY fall within the range set, the regex pattern will extract them.
  2. Notice that both dates are now being returned in full.

Currency

One of the most important things about currency values is to note the amounts listed—hundreds of dollars, thousands of dollars, as well as cent amounts and dollar signs. If dollar signs are provided, precede them with a backslash, \$, as the dollar sign by itself represents the end of a string in regex. Also, when writing pattern matches for currency, look for both the largest and smallest amounts provided, as this will determine the range for the placeholders.

For this example, you will build a pattern that will match all currency data listed.

  1. To extract the first three data instances, enter the following pattern:
    • \$\d{1,3}\.\d{2}
      • Notice that the dollar sign has been escaped by the backslash, as it is part of the text data.

To extract the three remaining instances, look at the way they're written. Anywhere from one (1) to three (3) digits after the dollar sign, three (3) digits after the comma, and cent amounts provided.

  1. Thus, your regex pattern should look like this:
    • \$\d{1,3},\d{3}\.\d{2}
    • Note that while the last three pieces of date are matched, the first three are no longer being picked up.

  1. Thus, to return all data, add parenthesis around ,\d{3} and follow with a question mark:
    • \$\d{1,3}(,\d{3})?\.\d{2}
      • By encasing ,\d{3} in parenthesis, you've created a Capture Group. For more information on Capture Groups, click here: RegexOne
      • The question mark is a Lazy Quantifier, meaning that its job is to collect one to many instances of the data immediately preceding (or following) it. In this case, it will catch one to many instances of the ,\d{3} Capture Group.
    • If you're unsure of how large your currency amounts will be, you can substitute the question mark ? with a star * character.
      • \$\d{1,3}(,\d{3})*\.\d{2}
      • The star is another quantifier, designed to capture zero to many instances of preceding data.
        • For more information on quantifiers, click here: [RegexOne]

Click here to return to the top of the section

Prefix and Suffix Patterns

Prefix and Suffix Patterns act as anchors to which you can tether the data you wish to extract. As one would expect, a Prefix Pattern matches what comes before your text matched by regex pattern, a Suffix Pattern is concerned with what comes after.

For example, let's say that you want to extract data on its own line, like the title of a section. While you can enter just the title, you might get false positives if the word(s) that make up the title appear anywhere else on the document. Thus, your Prefix and Suffix Patterns will be:

Prefix Pattern:[\n\t]|^

Suffix Pattern:[\r\t]|$

  • The ^ character matches the beginning of a string of text.
  • The $ character matches the end of a string of text.

See Also: