Pattern-Based (Collation Provider)

From Grooper Wiki
Jump to navigation Jump to search
WIP This article is a work-in-progress and may abruptly stop in the middle of a section.

Pattern-Based collation uses regular expression to select a sequence of child or referenced extractor results (and return the text in between them)

About

  • Allows you to use extractor results as inline variables within a “wrapper” expression.
  • @-variable calling of child extractors/intellisense
  • Great for when an array isn't specific enough, tricky lookarounds, needing infinite quantifiers in a fuzzy workflow.
  • Alternative to using Split collation: allows you to include start/end keys.

Pattern-Based Collation is a collation method for Data Type extractors that allows you to write a "wrapper" expression that can reference other extractors' results as variables.

Think of it as putting multiple extractors inside one RegEx pattern. When a node that is set to Pattern-Based Collation has at least one referenced (or child) extractor, you can reference that extractor as a variable by preceding it's name with an "@" in the pattern. (This will also bring up the intellisense prompt, which will list out any child extractors that can be referenced.)

Combining Regular RegEx and Fuzzy RegEx approaches for section When working with OCR results, a use can have a referenced/child extractor set to Fuzzy RegEx mode wrapped in an expression with an infinite quantifier (these are disallowed from Fuzzy RegEx mode).

Pattern-Based Collation is well-suited to natural language documents: since extractors are included as inline variables, users can define a more complex context (such as a sentence) surrounding the data they wish to extract. Consider the phrase:

"On this day, __________, the _____ day of _________, the year ______..."

You could create four extractors, each with lookaheads and lookbehinds. Alternatively, using Pattern-Based Collation, one could write one single "wrapper" pattern like:

"On this day\, the @Weekday\, the @Day day of @Month\, the year @Year/././."

Pattern-Based Collation is especially useful in contexts where the expressions for the referenced extractors are subject to change or regular updates, or are still in development. Using the above example, say we were working on a collection of documents that contained 10 unique Document Types that all presented the date in a different verbal format, but always in a way that it contained the weekday, day, month, and year. So we build ten different "wrapper" extractors, and set them to Pattern-Based Collation. Each one has "Weekday," "Day," "Month," and "Year" selected under "referenced extractors." This way, our ten different contexts (our "wrappers") all rely on the same handful of extractors to pull the same data elements. So when the Global Timelord Society decides to change Earth over to a 9-day week or to add a 13th month, you can update the extractors in one centralized location rather than in all ten contexts.