2023.1:Pattern-Based (Collation Provider): Difference between revisions

From Grooper Wiki
Created page with "{|cellpadding="10" cellspacing="5" |-style="background-color:#ed2330; color:white" |style="font-size:14pt"|'''WIP'''||This article is a work-in-progress and may abruptly stop..."
 
No edit summary
 
(17 intermediate revisions by 3 users not shown)
Line 1: Line 1:
{|cellpadding="10" cellspacing="5"
{{AutoVersion}}
|-style="background-color:#ed2330; color:white"
 
|style="font-size:14pt"|'''WIP'''||This article is a work-in-progress and may abruptly stop in the middle of a section.
<blockquote>{{#lst:Glossary|Pattern-Based}}</blockquote>
 
{|class="download-box"
|
[[File:Asset 22@4x.png]]
|
You may download the ZIP(s) below and upload it into your own Grooper environment (version 2023.1). The first contains one or more '''Batches''' of sample documents.  The second contains one or more '''Projects''' with resources used in examples throughout this article.
* [[Media:2023.1 Wiki Pattern-Based-(Collation-Provider) Batch.zip]]
* [[Media:2023.1 Wiki Pattern-Based-(Collation-Provider) Project.zip]]
|}
|}


<blockquote style="font-size:14pt">
== About ==
''Pattern-Based'' collation uses regular expression to select a sequence of child or referenced extractor results (and return the text in between them)
''Pattern-Based'' '''''Collation''''' is a collation method for '''Data Type''' extractors that allows you to write a "wrapper" expression that can reference other extractors' results as variables.
</blockquote>
Think of it as putting multiple extractors inside one RegEx pattern. When a '''Data Type''' that is set to ''Pattern-Based'' '''''Collation''''' has at least one child (or referenced) extractor, you can reference that extractor as a variable by preceding it's name with an "@" in the pattern. (This will also bring up the intellisense prompt, which will list out any child extractors that can be referenced.)
''Pattern-Based'' '''''Collation''''' is well-suited to unstructured "natural language" documents. Since extractors are included as inline variables, you can define a more complex context (such as a sentence) surrounding the data you wish to extract.
 
Consider the following example:


== About ==
[[File:2023.1 Pattern-Based-(Collation-Provider) 01 About 01.png]]


* Allows you to use extractor results as inline variables within a “wrapper” expression.
Let's say we wanted to collect the highlighted text:
* @-variable calling of child extractors/intellisense
* Great for when an array isn't specific enough, tricky lookarounds, needing infinite quantifiers in a fuzzy workflow.
* Alternative to using Split collation: allows you to include start/end keys.


Pattern-Based Collation is a collation method for Data Type extractors that allows you to write a "wrapper" expression that can reference other extractors' results as variables.
"entered into this ___ day of _____________ _____"
   
   
Think of it as putting multiple extractors inside one RegEx pattern. When a node that is set to Pattern-Based Collation has at least one referenced (or child) extractor, you can reference that extractor as a variable by preceding it's name with an "@" in the pattern. (This will also bring up the intellisense prompt, which will list out any child extractors that can be referenced.)
Using ''Pattern-Based'' '''''Collation''''' with the appropriate child or referenced extractors, you could write one single "wrapper" pattern like:
   
   
Combining Regular RegEx and Fuzzy RegEx approaches for section
<code>entered into this @Day day of @Month @Year</code>
When working with OCR results, a use can have a referenced/child extractor set to Fuzzy RegEx mode wrapped in an expression with an infinite quantifier (these are disallowed from Fuzzy RegEx mode).
   
   
Pattern-Based Collation is well-suited to natural language documents: since extractors are included as inline variables, users can define a more complex context (such as a sentence) surrounding the data they wish to extract. Consider the phrase:
''Pattern-Based'' '''''Collation''''' is especially useful in contexts where the expressions for the referenced extractors are subject to change. Using the above example, say we were working on a collection of documents that contained 10 unique '''Document Types''' that all presented the date in a different verbal format, but always in a way that it contained the day, month, and year. So we build ten different "wrapper" extractors (one for each '''Document Type'''), and set them to ''Pattern-Based'' '''''Collation'''''. Each one has "Day," "Month," and "Year" selected under "referenced extractors." This way, our ten different contexts (our "wrappers") all rely on the same handful of extractors to pull the same data elements.
 
"On this day, __________,  the _____ day of _________, the year ______..."
== How To ==
 
You could create four extractors, each with lookaheads and lookbehinds. Alternatively, using Pattern-Based Collation, one could write one single "wrapper" pattern like:
In this example, using the ''Pattern Match'' '''''Collation''''', we are going to extract the phrase "entered into this X day of Y Z" where "X" is the day, "Y" is the Month, and "Z" is the year.
 
"On this day\, the @Weekday\, the @Day day of @Month\, the year @Year/././."
<big>'''Creating the Parent and Child Objects'''</big>
 
Pattern-Based Collation is especially useful in contexts where the expressions for the referenced extractors are subject to change or regular updates, or are still in development. Using the above example, say we were working on a collection of documents that contained 10 unique Document Types that all presented the date in a different verbal format, but always in a way that it contained the weekday, day, month, and year. So we build ten different "wrapper" extractors, and set them to Pattern-Based Collation. Each one has "Weekday," "Day," "Month," and "Year" selected under "referenced extractors." This way, our ten different contexts (our "wrappers") all rely on the same handful of extractors to pull the same data elements. So when the Global Timelord Society decides to change Earth over to a 9-day week or to add a 13th month, you can update the extractors in one centralized location rather than in all ten contexts.
# Make a '''Data Type''' with child objects that extract different parts of the text segment you with to return.
#* In this case we have three child objects that extract the Day, Month, and Year.
# Alternatively, you can reference other extractors in your project rather than having child objects. Just use the '''''Referenced Extractors''''' property to do so.
 
[[File:2023.1 Pattern-Based-(Collation-Provider) 02 How-To 01.png]]
 
 
#<li value=3> The first child object in our example is extracting the day in our pattern.
# The '''Value Reader''' has been set to a pattern match and the pattern <code>\d{1,2}th</code> has been entered to collect "Xth" where X is a 1 or 2 digit number.
# On the page this '''Value Reader''' is returning "6th".
 
[[File:2023.1 Pattern-Based-(Collation-Provider) 02 How-To 02.png]]
 
 
#<li value=6> The second child object is set to a ''List Match'' collecting the month
 
[[File:2023.1 Pattern-Based-(Collation-Provider) 02 How-To 03.png]]
 
 
#<li value=7> The last child object is set to a ''Pattern Match'' to collect 4 digit numbers, so it should capture the year.
 
[[File:2023.1 Pattern-Based-(Collation-Provider) 02 How-To 04.png]]
 
 
<big>'''Setting the Pattern-Based Collation Property'''</big>
 
# Click on the parent '''Data Type'''.
# Click on the hamburger icon to the right of the '''''Collation''''' property.
# Select ''Pattern Based'' from the drop down.
 
[[File:2023.1 Pattern-Based-(Collation-Provider) 02 How-To 05.png]]
 
 
<big>'''Entering in the Value Pattern'''</big>
 
# Open up the '''''Collation''''' property and then click the ellipsis icon to the right of the '''''Value Pattern''''' property.
 
[[File:2023.1 Pattern-Based-(Collation-Provider) 02 How-To 06.png]]
 
 
#<li value=2> Start writing your pattern in the "Value Pattern" window. When you get to the place where you need to use one of your child extractors, type in the @ symbol.
# An intellisense drop down will appear with extractors considered within the scope of the '''Data Type'''. Select the desired extractor from the drop down or finish typing it in.
 
[[File:2023.1 Pattern-Based-(Collation-Provider) 02 How-To 07.png]]
 
 
#<li value=4> Finish writing your pattern, adding each child or referenced extractor using the @ symbol.
# Click "OK" in the top right corner of the window to save.
 
[[File:2023.1 Pattern-Based-(Collation-Provider) 02 How-To 08.png]]
 
 
#<li value=6> Now the text segment "entered into this 6th day of November 2016" is being returned.
 
[[File:2023.1 Pattern-Based-(Collation-Provider) 02 How-To 09.png]]

Latest revision as of 10:04, 22 November 2024

This article is about an older version of Grooper.

Information may be out of date and UI elements may have changed.

20252023.1

Pattern-Based is a Collation Provider option for pin Data Type extractors. Pattern-Based uses regular expressions to sequence returned results into a final result set.

You may download the ZIP(s) below and upload it into your own Grooper environment (version 2023.1). The first contains one or more Batches of sample documents. The second contains one or more Projects with resources used in examples throughout this article.

About

Pattern-Based Collation is a collation method for Data Type extractors that allows you to write a "wrapper" expression that can reference other extractors' results as variables.

Think of it as putting multiple extractors inside one RegEx pattern. When a Data Type that is set to Pattern-Based Collation has at least one child (or referenced) extractor, you can reference that extractor as a variable by preceding it's name with an "@" in the pattern. (This will also bring up the intellisense prompt, which will list out any child extractors that can be referenced.)

Pattern-Based Collation is well-suited to unstructured "natural language" documents. Since extractors are included as inline variables, you can define a more complex context (such as a sentence) surrounding the data you wish to extract.

Consider the following example:

Let's say we wanted to collect the highlighted text:

"entered into this ___ day of _____________ _____"

Using Pattern-Based Collation with the appropriate child or referenced extractors, you could write one single "wrapper" pattern like:

entered into this @Day day of @Month @Year

Pattern-Based Collation is especially useful in contexts where the expressions for the referenced extractors are subject to change. Using the above example, say we were working on a collection of documents that contained 10 unique Document Types that all presented the date in a different verbal format, but always in a way that it contained the day, month, and year. So we build ten different "wrapper" extractors (one for each Document Type), and set them to Pattern-Based Collation. Each one has "Day," "Month," and "Year" selected under "referenced extractors." This way, our ten different contexts (our "wrappers") all rely on the same handful of extractors to pull the same data elements.

How To

In this example, using the Pattern Match Collation, we are going to extract the phrase "entered into this X day of Y Z" where "X" is the day, "Y" is the Month, and "Z" is the year.

Creating the Parent and Child Objects

  1. Make a Data Type with child objects that extract different parts of the text segment you with to return.
    • In this case we have three child objects that extract the Day, Month, and Year.
  2. Alternatively, you can reference other extractors in your project rather than having child objects. Just use the Referenced Extractors property to do so.


  1. The first child object in our example is extracting the day in our pattern.
  2. The Value Reader has been set to a pattern match and the pattern \d{1,2}th has been entered to collect "Xth" where X is a 1 or 2 digit number.
  3. On the page this Value Reader is returning "6th".


  1. The second child object is set to a List Match collecting the month.


  1. The last child object is set to a Pattern Match to collect 4 digit numbers, so it should capture the year.


Setting the Pattern-Based Collation Property

  1. Click on the parent Data Type.
  2. Click on the hamburger icon to the right of the Collation property.
  3. Select Pattern Based from the drop down.


Entering in the Value Pattern

  1. Open up the Collation property and then click the ellipsis icon to the right of the Value Pattern property.


  1. Start writing your pattern in the "Value Pattern" window. When you get to the place where you need to use one of your child extractors, type in the @ symbol.
  2. An intellisense drop down will appear with extractors considered within the scope of the Data Type. Select the desired extractor from the drop down or finish typing it in.


  1. Finish writing your pattern, adding each child or referenced extractor using the @ symbol.
  2. Click "OK" in the top right corner of the window to save.


  1. Now the text segment "entered into this 6th day of November 2016" is being returned.