2023.1:Pattern-Based (Collation Provider)

From Grooper Wiki
Revision as of 09:03, 30 April 2024 by Rpatton (talk | contribs) (glossary // via Wikitext Extension for VSCode)

This article is about an older version of Grooper.

Information may be out of date and UI elements may have changed.

20252023.1

WIP

This article is a work-in-progress or created as a placeholder for testing purposes. This article is subject to change and/or expansion. It may be incomplete, inaccurate, or stop abruptly.

This tag will be removed upon draft completion.

Pattern-Based is a Collation Provider option for pin Data Type extractors. Pattern-Based uses regular expressions to sequence returned results into a final result set.

Glossary

Collation Provider: The Collation property of a pin Data Type defines the method for converting its raw results into a final result set. It is configured by selecting a Collation Provider. The Collation Provider governs how initial matches from the Data Type's extractor(s) are combined and interpreted to produce the Data Type's final output. Data Type: pin Data Types are nodes used to extract text data from a document. Data Types have more capabilities than quick_reference_all Value Readers. Data Types can collect results from multiple extractor sources, including a locally defined extractor, child extractor nodes, and referenced extractor nodes. Data Types can also collate results using Collation Providers to combine, sift and manipulate results further. Extractor Type:

About

Pattern-Based Collation is a collation method for Data Type extractors that allows you to write a "wrapper" expression that can reference other extractors' results as variables.

Think of it as putting multiple extractors inside one RegEx pattern. When a Data Type that is set to Pattern-Based Collation has at least one child (or referenced) extractor, you can reference that extractor as a variable by preceding it's name with an "@" in the pattern. (This will also bring up the intellisense prompt, which will list out any child extractors that can be referenced.)

Pattern-Based Collation is well-suited to unstructured "natural language" documents. Since extractors are included as inline variables, you can define a more complex context (such as a sentence) surrounding the data you wish to extract.

Consider the following example:

Let's say we wanted to collect the highlighted text:

"entered into this ___ day of _____________ _____"

Using Pattern-Based Collation with the appropriate child or referenced extractors, you could write one single "wrapper" pattern like:

entered into this @Day day of @Month @Year

Pattern-Based Collation is especially useful in contexts where the expressions for the referenced extractors are subject to change. Using the above example, say we were working on a collection of documents that contained 10 unique Document Types that all presented the date in a different verbal format, but always in a way that it contained the day, month, and year. So we build ten different "wrapper" extractors (one for each Document Type), and set them to Pattern-Based Collation. Each one has "Day," "Month," and "Year" selected under "referenced extractors." This way, our ten different contexts (our "wrappers") all rely on the same handful of extractors to pull the same data elements.

How To

In this example, using the Pattern Match Collation, we are going to extract the phrase "entered into this X day of Y Z" where "X" is the day, "Y" is the Month, and "Z" is the year.

Creating the Parent and Child Objects

  1. Make a Data Type with child objects that extract different parts of the text segment you with to return.
    • In this case we have three child objects that extract the Day, Month, and Year.
  2. Alternatively, you can reference other extractors in your project rather than having child objects. Just use the Referenced Extractors property to do so.


  1. The first child object in our example is extracting the day in our pattern.
  2. The Value Reader has been set to a pattern match and the pattern \d{1,2}th has been entered to collect "Xth" where X is a 1 or 2 digit number.
  3. On the page this Value Reader is returning "6th".


  1. The second child object is set to a List Match collecting the month.


  1. The last child object is set to a pattern match to collect 4 digit numbers, so it should capture the year.


Setting the Pattern-Based Collation Property

  1. Click on the parent Data Type.
  2. Click on the hamburger icon to the right of the Collation property.
  3. Select Pattern Based from the drop down.


Entering in the Value Pattern

  1. Open up the Collation property and then click the ellipsis icon to the right of the Value Pattern property.


  1. Start writing your pattern in the "Value Pattern" window. When you get to the place where you need to use one of your child extractors, type in the @ symbol.
  2. An intellisense drop down will appear with extractors considered within the scope of the Data Type. Select the desired extractor from the drop down or finish typing it in.


  1. Finish writing your pattern, adding each child or referenced extractor using the @ symbol.
  2. Click "OK" in the top right corner of the window to save.


  1. Now the text segment "entered into this 6th day of November 2016" is being returned.