2023.1:AND (Collation Provider)

From Grooper Wiki

This article is about an older version of Grooper.

Information may be out of date and UI elements may have changed.

20252023.1

AND is a Collation Provider option for pin Data Type extractors. AND returns results only when each of its referenced or child extractors gets at least one hit, thus acting as a logical “AND” operator across multiple extractors.

You may download the ZIP(s) below and upload it into your own Grooper environment (version 2023.1). The first contains a Project with resources used in examples throughout this article. The second contains one or more Batches of sample documents.

Glossary

AND: AND is a Collation Provider option for pin Data Type extractors. AND returns results only when each of its referenced or child extractors gets at least one hit, thus acting as a logical “AND” operator across multiple extractors.

Collation Provider: The Collation property of a pin Data Type defines the method for converting its raw results into a final result set. It is configured by selecting a Collation Provider. The Collation Provider governs how initial matches from the Data Type's extractor(s) are combined and interpreted to produce the Data Type's final output.

Data Type: pin Data Types are nodes used to extract text data from a document. Data Types have more capabilities than quick_reference_all Value Readers. Data Types can collect results from multiple extractor sources, including a locally defined extractor, child extractor nodes, and referenced extractor nodes. Data Types can also collate results using Collation Providers to combine, sift and manipulate results further.

Project: package_2 Projects are the primary containers for configuration nodes within Grooper. The Project is where various processing objects such as stacks Content Models, settings Batch Processes, profile objects are stored. This makes resources easier to manage, easier to save, and simplifies how node references are made in a Grooper Repository.

Batch: inventory_2 Batch nodes are fundamental in Grooper's architecture. They are containers of documents that are moved through workflow mechanisms called settings Batch Processes. Documents and their pages are represented in Batches by a hierarchy of folder Batch Folders and contract Batch Pages.

Data Extraction: Data Extraction involves identifying and capturing specific information from documents (represented by folder Batch Folders in Grooper). Extraction is performed by configurable Data Extractors, which transform unstructured or semi-structured data into a structured, usable format for processing and analysis.

Extract: export_notes Extract is an Activity that retrieves information from folder Batch Folder documents, as defined by Data Elements in a data_table Data Model. This is how Grooper locates unstructured data on your documents and collects it in a structured, usable format.

Node Tree: The Node Tree is the hierarchical list of Grooper node objects found in the left panel in the Design Page. It is the basis for navigation and creation in the Design Page.

Value Reader: quick_reference_all Value Reader nodes define a single data extraction operation. Each Value Reader executes a single Value Extractor configuration. The Value Extractor determines the logic for returning data from a text-based document or page. (Example: Pattern Match is a Value Extractor that returns data using regular expressions).

  • Value Readers are can be used on their own or in conjunction with pin Data Types for more complex data extraction and collation.

Classification: Classification is the process of identifying and organizing documents into categorical types based on their content or layout. Classification is key for efficient document management and data extraction workflows. Grooper has different methods for classifying documents. These include methods that use machine learning and text pattern recognition. In a Grooper Batch Process, the Classify Activity will assign a Content Type to a folder Batch Folder.

Document Type: description Document Type nodes represent a distinct type of document, such as an invoice or a contract. Document Types are created as child nodes of a stacks Content Model or a collections_bookmark Content Category. They serve three primary purposes:

  1. They are used to classify documents. Documents are considered "classified" when the folder Batch Folder is assigned a Content Type (most typically, a Document Type).
  2. The Document Type's data_table Data Model defines the Data Elements extracted by the Extract activity (including any Data Elements inherited from parent Content Types).
  3. The Document Type defines all "Behaviors" that apply (whether from the Document Type's Behavior settings or those inherited from a parent Content Type).

About

The AND Collation Provider is a type of Collation Provider designed to return a result if and only if each of its child extractors returns a result. As long as each child type extractor returns a result, then these individual results will be passed up to the parent AND Collation Provider and returned as one combined value, as seen below.



The AND Collation Provider can be configured to return a result if only a certain number of child extractors return a result. This will be detailed in an example below. Suffice it to say, that even if two out of three child extractors return a result, if configured properly, the AND Collation Provider will still display results.

Setting Up the Collation Provider

Setting up for Data Extraction

To begin, you must set the Collation Provider on the Data Type.

  1. To do so, select a Data Type in the Node Tree.
  2. Under General properties, select Collation and select the hamburger icon at the far-right of the property to expand the drop-down menu.
  3. Select AND.



Once that's done, be sure to add child extractors. These can be Data Types as well, or Value Readers. Just so long as they return a result that can be passed up to the parent Data Type.

Minimum Hits

Be aware of a property called "Minimum Hits" that can be found when expanding the AND Collation Provider. Normally, this property is defaulted to zero. This ensures that all extractors must produce a hit in order to get a positive result. If the number of Minimum Hits is changed to, let's say 2, then all exactors need to meet at least two of the whatever many criteria are needed to produce a result. Be cautious, as this changing the Minimum Hits property could skew results. This is illustrated below.


Setting up for Classification

How does Classification play into the AND Collation Provider? Since the AND Collation relies on positive hits to extract data, you can make use of it through the Positive Extractor property on a Document Type. Simply use your configured Collation Provider as a referenced extractor for your Positive Extractor.



What Does This Mean for Classification?

So, how exactly can an AND Collation Provider assist with Classification? Simply put, it is a tool that can be referenced on a Positive Extractor to help Grooper identify certain Document Types.