2023.1:Pattern-Based (Collation Provider): Difference between revisions

Revision as of 11:49, 28 August 2024

This article is about an older version of Grooper.

Information may be out of date and UI elements may have changed.

2025

2023.1

Pattern-Based is a Collation Provider option for pin Data Type extractors. Pattern-Based uses regular expressions to sequence returned results into a final result set.

You may download the ZIP(s) below and upload it into your own Grooper environment (version 2023.1). The first contains one or more Batches of sample documents. The second contains one or more Projects with resources used in examples throughout this article.

About

Pattern-Based Collation is a collation method for Data Type extractors that allows you to write a "wrapper" expression that can reference other extractors' results as variables.

Think of it as putting multiple extractors inside one RegEx pattern. When a Data Type that is set to Pattern-Based Collation has at least one child (or referenced) extractor, you can reference that extractor as a variable by preceding it's name with an "@" in the pattern. (This will also bring up the intellisense prompt, which will list out any child extractors that can be referenced.)

Pattern-Based Collation is well-suited to unstructured "natural language" documents. Since extractors are included as inline variables, you can define a more complex context (such as a sentence) surrounding the data you wish to extract.

Consider the following example:

Let's say we wanted to collect the highlighted text:

"entered into this ___ day of _____________ _____"

Using Pattern-Based Collation with the appropriate child or referenced extractors, you could write one single "wrapper" pattern like:

entered into this @Day day of @Month @Year

Pattern-Based Collation is especially useful in contexts where the expressions for the referenced extractors are subject to change. Using the above example, say we were working on a collection of documents that contained 10 unique Document Types that all presented the date in a different verbal format, but always in a way that it contained the day, month, and year. So we build ten different "wrapper" extractors (one for each Document Type), and set them to Pattern-Based Collation. Each one has "Day," "Month," and "Year" selected under "referenced extractors." This way, our ten different contexts (our "wrappers") all rely on the same handful of extractors to pull the same data elements.

How To

In this example, using the Pattern Match Collation, we are going to extract the phrase "entered into this X day of Y Z" where "X" is the day, "Y" is the Month, and "Z" is the year.

Creating the Parent and Child Objects

Make a Data Type with child objects that extract different parts of the text segment you with to return.
- In this case we have three child objects that extract the Day, Month, and Year.
Alternatively, you can reference other extractors in your project rather than having child objects. Just use the Referenced Extractors property to do so.

The first child object in our example is extracting the day in our pattern.
The Value Reader has been set to a pattern match and the pattern \d{1,2}th has been entered to collect "Xth" where X is a 1 or 2 digit number.
On the page this Value Reader is returning "6th".

The second child object is set to a List Match collecting the month.

The last child object is set to a Pattern Match to collect 4 digit numbers, so it should capture the year.

Setting the Pattern-Based Collation Property

Click on the parent Data Type.
Click on the hamburger icon to the right of the Collation property.
Select Pattern Based from the drop down.

Entering in the Value Pattern

Open up the Collation property and then click the ellipsis icon to the right of the Value Pattern property.

Start writing your pattern in the "Value Pattern" window. When you get to the place where you need to use one of your child extractors, type in the @ symbol.
An intellisense drop down will appear with extractors considered within the scope of the Data Type. Select the desired extractor from the drop down or finish typing it in.

Finish writing your pattern, adding each child or referenced extractor using the @ symbol.
Click "OK" in the top right corner of the window to save.

Now the text segment "entered into this 6th day of November 2016" is being returned.

Glossary

Batch Folder: The folder Batch Folder is an organizational unit within a inventory_2 Batch, allowing for a structured approach to managing and processing a collection of documents. Batch Folder nodes serve two purposes in a Batch. (1) Primarily, they represent "documents" in Grooper. (2) They can also serve more generally as folders, holding other Batch Folders and/or contract Batch Page nodes as children.

Batch Folders are frequently referred to simply as "documents" or "folders" depending on how they are used in the Batch.

Batch: inventory_2 Batch nodes are fundamental in Grooper's architecture. They are containers of documents that are moved through workflow mechanisms called settings Batch Processes. Documents and their pages are represented in Batches by a hierarchy of folder Batch Folders and contract Batch Pages.

Behavior: A "Behavior" is one of several features applied to a Content Type (such as a description Document Type). Behaviors affect how certain Activities and Commands are executed, based how a document (folder Batch Folder) is classified. They behave differently, according to their Document Type. This includes how they are exported (how Export behaves), if and how they are added to a document search index (how the various indexing commands behave), and if and how Label Sets are used (how Classify and Extract behave in the presence of Label Sets).

Each Behavior is enabled by adding it to a Content Type. They are configured in the Behaviors editor.
Behaviors extend to descendent Content Types, if the descendent Content Types has no Behavior configuration of its own.
- For example, all Document Types will inherit their parent Content Model's Behaviors.
- However, if a Document Type has its own Behavior configuration, it will be used instead.

Collation Provider: The Collation property of a pin Data Type defines the method for converting its raw results into a final result set. It is configured by selecting a Collation Provider. The Collation Provider governs how initial matches from the Data Type's extractor(s) are combined and interpreted to produce the Data Type's final output.

Content Category: collections_bookmark A Content Category is a container for other Content Category or description Document Type nodes in a stacks Content Model. Content Categories are often used simply as organizational buckets for Content Models with large numbers of Document Types. However, Content Categories are also necessary to create branches in a Content Model's classification taxonomy, allowing for more complex Data Element inheritance and Behavior inheritance.

Content Model: stacks Content Model nodes define a classification taxonomy for document sets in Grooper. This taxonomy is defined by the collections_bookmark Content Categories and description Document Types they contain. Content Models serve as the root of a Content Type hierarchy, which defines Data Element inheritance and Behavior inheritance. Content Models are crucial for organizing documents for data extraction and more.

Content Type: Content Types are a class of node types used used to classify folder Batch Folders. They represent categories of documents (stacks Content Models and collections_bookmark Content Categories) or distinct types of documents (description Document Types). Content Types serve an important role in defining Data Elements and Behaviors that apply to a document.

Data Element: Data Elements are a class of node types used to collect data from a document. These include: data_table Data Models, insert_page_break Data Sections, variables Data Fields, table Data Tables, and view_column Data Columns.

Data Extractor: Data Extractor (or just "extractor") refers to all Value Extractors and Extractor Nodes. Extractors define the logic used to return data from a document's text content, including general data (such as a date) and specific data (such as an agreement date on a contract).

Data Type: pin Data Types are nodes used to extract text data from a document. Data Types have more capabilities than quick_reference_all Value Readers. Data Types can collect results from multiple extractor sources, including a locally defined extractor, child extractor nodes, and referenced extractor nodes. Data Types can also collate results using Collation Providers to combine, sift and manipulate results further.

Document Type: description Document Type nodes represent a distinct type of document, such as an invoice or a contract. Document Types are created as child nodes of a stacks Content Model or a collections_bookmark Content Category. They serve three primary purposes:

They are used to classify documents. Documents are considered "classified" when the folder Batch Folder is assigned a Content Type (most typically, a Document Type).
The Document Type's data_table Data Model defines the Data Elements extracted by the Extract activity (including any Data Elements inherited from parent Content Types).
The Document Type defines all "Behaviors" that apply (whether from the Document Type's Behavior settings or those inherited from a parent Content Type).

Extract: export_notes Extract is an Activity that retrieves information from folder Batch Folder documents, as defined by Data Elements in a data_table Data Model. This is how Grooper locates unstructured data on your documents and collects it in a structured, usable format.

Extractor Type:

List Match: List Match is a Value Extractor designed to return values matching one or more items in a defined list. By default, the List Match extractor does not use or require regular expression, but can be configured to utilize regular expression syntax.

Pattern Match: Pattern Match is a Value Extractor that extracts values from a document that match a specified regular expression, providing data collection following a known format or pattern.

Pattern-Based: Pattern-Based is a Collation Provider option for pin Data Type extractors. Pattern-Based uses regular expressions to sequence returned results into a final result set.

Project: package_2 Projects are the primary containers for configuration nodes within Grooper. The Project is where various processing objects such as stacks Content Models, settings Batch Processes, profile objects are stored. This makes resources easier to manage, easier to save, and simplifies how node references are made in a Grooper Repository.

Reference: Reference is a Value Extractor used to reference an Extractor Node. This allows users to create re-usable extractors and use the more complex pin Data Type and input Field Class extractors throughout Grooper.

Value Reader: quick_reference_all Value Reader nodes define a single data extraction operation. Each Value Reader executes a single Value Extractor configuration. The Value Extractor determines the logic for returning data from a text-based document or page. (Example: Pattern Match is a Value Extractor that returns data using regular expressions).

Value Readers are can be used on their own or in conjunction with pin Data Types for more complex data extraction and collation.