2023.1:Pattern-Based (Collation Provider)

From Grooper Wiki
Revision as of 09:24, 30 April 2024 by Rpatton (talk | contribs) (final // via Wikitext Extension for VSCode)

This article is about an older version of Grooper.

Information may be out of date and UI elements may have changed.

20252023.1

Pattern-Based is a Collation Provider option for pin Data Type extractors. Pattern-Based uses regular expressions to sequence returned results into a final result set.

You may download the ZIP(s) below and upload it into your own Grooper environment (version 2023.1). The first contains one or more Batches of sample documents. The second contains one or more Projects with resources used in examples throughout this article.

Glossary

Collation Provider: The Collation property of a pin Data Type defines the method for converting its raw results into a final result set. It is configured by selecting a Collation Provider. The Collation Provider governs how initial matches from the Data Type's extractor(s) are combined and interpreted to produce the Data Type's final output. Data Type: pin Data Types are nodes used to extract text data from a document. Data Types have more capabilities than quick_reference_all Value Readers. Data Types can collect results from multiple extractor sources, including a locally defined extractor, child extractor nodes, and referenced extractor nodes. Data Types can also collate results using Collation Providers to combine, sift and manipulate results further. Document Type: description Document Type nodes represent a distinct type of document, such as an invoice or a contract. Document Types are created as child nodes of a stacks Content Model or a collections_bookmark Content Category. They serve three primary purposes:

  1. They are used to classify documents. Documents are considered "classified" when the folder Batch Folder is assigned a Content Type (most typically, a Document Type).
  2. The Document Type's data_table Data Model defines the Data Elements extracted by the Extract activity (including any Data Elements inherited from parent Content Types).
  3. The Document Type defines all "Behaviors" that apply (whether from the Document Type's Behavior settings or those inherited from a parent Content Type).

Extractor Type: List Match: List Match is a Value Extractor designed to return values matching one or more items in a defined list. By default, the List Match extractor does not use or require regular expression, but can be configured to utilize regular expression syntax. Pattern Match: Pattern Match is a Value Extractor that extracts values from a document that match a specified regular expression, providing data collection following a known format or pattern. Value Reader: quick_reference_all Value Reader nodes define a single data extraction operation. Each Value Reader executes a single Value Extractor configuration. The Value Extractor determines the logic for returning data from a text-based document or page. (Example: Pattern Match is a Value Extractor that returns data using regular expressions).

  • Value Readers are can be used on their own or in conjunction with pin Data Types for more complex data extraction and collation.

About

Pattern-Based Collation is a collation method for Data Type extractors that allows you to write a "wrapper" expression that can reference other extractors' results as variables.

Think of it as putting multiple extractors inside one RegEx pattern. When a Data Type that is set to Pattern-Based Collation has at least one child (or referenced) extractor, you can reference that extractor as a variable by preceding it's name with an "@" in the pattern. (This will also bring up the intellisense prompt, which will list out any child extractors that can be referenced.)

Pattern-Based Collation is well-suited to unstructured "natural language" documents. Since extractors are included as inline variables, you can define a more complex context (such as a sentence) surrounding the data you wish to extract.

Consider the following example:

Let's say we wanted to collect the highlighted text:

"entered into this ___ day of _____________ _____"

Using Pattern-Based Collation with the appropriate child or referenced extractors, you could write one single "wrapper" pattern like:

entered into this @Day day of @Month @Year

Pattern-Based Collation is especially useful in contexts where the expressions for the referenced extractors are subject to change. Using the above example, say we were working on a collection of documents that contained 10 unique Document Types that all presented the date in a different verbal format, but always in a way that it contained the day, month, and year. So we build ten different "wrapper" extractors (one for each Document Type), and set them to Pattern-Based Collation. Each one has "Day," "Month," and "Year" selected under "referenced extractors." This way, our ten different contexts (our "wrappers") all rely on the same handful of extractors to pull the same data elements.

How To

In this example, using the Pattern Match Collation, we are going to extract the phrase "entered into this X day of Y Z" where "X" is the day, "Y" is the Month, and "Z" is the year.

Creating the Parent and Child Objects

  1. Make a Data Type with child objects that extract different parts of the text segment you with to return.
    • In this case we have three child objects that extract the Day, Month, and Year.
  2. Alternatively, you can reference other extractors in your project rather than having child objects. Just use the Referenced Extractors property to do so.


  1. The first child object in our example is extracting the day in our pattern.
  2. The Value Reader has been set to a pattern match and the pattern \d{1,2}th has been entered to collect "Xth" where X is a 1 or 2 digit number.
  3. On the page this Value Reader is returning "6th".


  1. The second child object is set to a List Match collecting the month.


  1. The last child object is set to a Pattern Match to collect 4 digit numbers, so it should capture the year.


Setting the Pattern-Based Collation Property

  1. Click on the parent Data Type.
  2. Click on the hamburger icon to the right of the Collation property.
  3. Select Pattern Based from the drop down.


Entering in the Value Pattern

  1. Open up the Collation property and then click the ellipsis icon to the right of the Value Pattern property.


  1. Start writing your pattern in the "Value Pattern" window. When you get to the place where you need to use one of your child extractors, type in the @ symbol.
  2. An intellisense drop down will appear with extractors considered within the scope of the Data Type. Select the desired extractor from the drop down or finish typing it in.


  1. Finish writing your pattern, adding each child or referenced extractor using the @ symbol.
  2. Click "OK" in the top right corner of the window to save.


  1. Now the text segment "entered into this 6th day of November 2016" is being returned.