Object Nomenclature (Concept): Difference between revisions

From Grooper Wiki
 
(4 intermediate revisions by the same user not shown)
Line 16: Line 16:
</div>
</div>
== Batch Objects ==
== Batch Objects ==
In '''Grooper''', "Batch Objects" represent the hierarchical structure of documents being processed.  These objects consist of:
{{#lst:Batch Object|Batch Objects}}
: {{BatchIcon}} '''[[Batch]]es'''
: {{BatchFolderIcon}} '''[[Batch Folder]]s'''
: {{BatchPageIcon}} '''[[Batch Page]]s'''
Each serves a distinct function within this hierarchy but are also fundamentally related to each other.
 
The relationship between these objects is hierarchical in nature.
* The '''Batch''' object is the top level
* Each '''Batch''' contains: '''Batch Folders''' and '''Batch Pages'''
** '''Batch Folders''' may contain further '''Batch Folders''' to represent subfolders or grouped documents.
** Or, '''Batch Folders''' will contain '''Batch Pages''' to represent individual pages of documents.
 
This structured approach allows '''Grooper''' to efficiently manage and process documents at various levels of granularity — from a full batch down to individual pages.
<div style="padding-left: 1.5em">
=== Related Objects ===
<div style="padding-left: 1.5em">
==== Batch ====
{{#lst:Glossary|Batch}}
 
==== Batch Folder ====
{{#lst:Glossary|Batch Folder}}
 
==== Batch Page ====
{{#lst:Glossary|Batch Page}}
 
They are created in one of two ways:
* Physical pages can be acquired in '''Grooper''' by scanning them via the '''[[Desktop Scanning in Grooper|Grooper Desktop]]''' application.
* Digital documents are acquired in '''Grooper''' as whole objects and represented as '''Batch Folders'''. Applying the [[Split Pages]] activity on a '''Batch Folder''' that represents a digital document will expose '''Batch Page''' objects as direct children.
'''Batch Pages''' allow '''Grooper''' to process and store information at the page level, which is essential for operations that include [[Image Processing]] and recognition of text (see [[Recognize]]). They enable the system to manage and process each page independently. This is critical for workflows that require detailed page-specific actions or for '''Batches''' composed of documents with different processing requirements per page.
</div>
</div>


== Organization Nodes ==
== Organization Nodes ==
Line 85: Line 55:


== Data Elements ==
== Data Elements ==
{{#lst:Category:Data Element|Data Element Objects}}
{{#lst:Data Element|Data Element Objects}}


== Extractor nodes ==
== Extractor Nodes ==
{{#lst:Data Extractor (Concept)|Extractor Objects}}
{{#lst:Extractor Node|Extractor Nodes}}


== Connection nodes==
== Connection nodes==

Latest revision as of 16:34, 6 August 2025

A Grooper environment consists of many interrelated objects.

The Grooper Wiki's Object Nomenclature defines how Grooper users categorize and refer to different types of Node Objects in a Grooper Repository. Knowing what objects can be added to the Grooper Node Tree and how they are related is a critical part of understanding Grooper itself.

About

Understanding Grooper involves recognizing how different objects serve similar functions and therefore be grouped together based on their shared functionalities. Disparate objects often perform analogous tasks, albeit with differing characteristics or representations.

By discerning commonalities in functionality across diverse objects, users can streamline their approach to data processing and analysis within Grooper. Rather than treating each object in isolation, users can categorize them based on their functional similarities, thus simplifying management and enhancing efficiency.

This approach fosters a more holistic understanding of Grooper, empowering users to devise more effective strategies for data extraction, classification, and interpretation. By recognizing the underlying functional relationships between objects, users can optimize workflows, improve accuracy, and derive deeper insights from their data.

High Level Overview

This article is meant to be a high level overview of many objects in Grooper and how they're related.

  • Primarily, this article is focused on "Node Types" but will include other Grooper objects when appropriate.
  • If you need more specific information on a particular object, please click the hyperlink for that specific object (as listed in the category's "Related Objects" section) to be taken to an article giving more information on that object.

Batch Objects

Types of Batch Objects

There are three primary types of Batch Objects in Grooper:

  • inventory_2 Batch: The root object representing the entire batch of documents.
  • folder Batch Folder: Used to organize documents and subfolders within a Batch.
    • Batch Folders represent a "document" when either (1) they have child Batch Pages (representing pages of a document) or (2) they have a file attached to the Batch Folder (typically this occurs when files are imported from an Import Job).
  • contract Batch Page: Represents an individual page of content, such as a scanned image or split page of an imported file.

Each type inherits common functionality from the Batch Object base, while also providing specialized properties and commands for their specific roles.

Hierarchical organization

Batch Objects are organized in a tree structure:

  • The root of the tree is the Batch, which contains one or more Batch Folders and/or Batch Pages.
  • Each Batch Folder can contain additional Batch Folders or Batch Pages. Batch Folders typically represent documents but can be used as folders in the Batch as well.
  • Batch Pages represent individual pages of content, such as scanned images or imported files.

This hierarchy allows Grooper to manage complex document sets, supporting nested folders and multi-page documents within a single Batch.

Example Batch hierarchy

inventory_2 Batch
folder Batch Folder
contract Batch Page
contract Batch Page
folder Batch Folder
contract Batch Page
contract Batch Page

Participation in Batch processing

Batch Objects participate in all stages of Batch processing, including:

  • Classification
  • Data extraction
  • Export
  • Review and exception handling

Related Objects

Batch

inventory_2 Batch nodes are fundamental in Grooper's architecture. They are containers of documents that are moved through workflow mechanisms called settings Batch Processes. Documents and their pages are represented in Batches by a hierarchy of folder Batch Folders and contract Batch Pages.

Batches are foundational to Grooper's document processing. Production Batches are assigned Batch Processes on creation which control all aspects of a document processing workflow.

How Batches are created

Batches are created in one of three typical ways:

  • For scanned documents: from the Batches or Tasks Page
    Physical pages are acquired in Grooper by scanning them via a Scan Viewer in a Review step. First, a Batch is created with the "Create New Batch" button in the Batches Page or Tasks Page
  • For imported documents: from Import Jobs
    Digital documents are acquired in Grooper from Import Jobs. Import Jobs are either summited by an Import Watcher service or manually from the Imports Page. Batches are created according to the Import Job's Import Provider settings.
  • For test Batches: from the Design Page
    Test Batches are created manually by Design Page users in the "Test" branch of the "Batches" folder. Right click any folder in the Test branch to add a new test Batch.

Batch Folder

The folder Batch Folder is an organizational unit within a inventory_2 Batch, allowing for a structured approach to managing and processing a collection of documents. Batch Folder nodes serve two purposes in a Batch. (1) Primarily, they represent "documents" in Grooper. (2) They can also serve more generally as folders, holding other Batch Folders and/or contract Batch Page nodes as children.

  • Batch Folders are frequently referred to simply as "documents" or "folders" depending on how they are used in the Batch.

Batch Folders are critical to how Grooper represents documents. They are critical to document classification, data extraction and export operations. Documents are processed by executing Grooper Activities and Commands executed at the "document level" in a Batch Process (meaning the Batch Folder level that form "documents" and not subfolders in a Batch). Batch Folders also store information at the folder level, including files attached to Batch Folders created on import.

How Batch Folders are created

Batch Folders are created in one of three typical ways:

  • When pages are separated
    Loose pages are organized into documents by the Separate activity (or Separation Profiles at scan time in a Scan Viewer). When separation occurs, a Separation Provider identifies Batch Pages that qualify as the first page of a document. Then, Batch Folders are created for each identified document and each span of Batch Pages are placed in each Batch Folder.
  • When files are imported
    " Digital documents are acquired in Grooper from Import Jobs. Import Jobs are either summited by an Import Watcher service or manually from the Imports Page. For each imported file, a Batch Folder is created and the file is attached to it.
  • When files are dragged into a test Batch
    When testing configurations from the Design page, digital files can be quickly added to a test Batch by simply dragging it from your computer to a "Test Source" panel. You will find a Test Source panel in any "Tester" tab in the Design page.

Batch Page

contract Batch Page nodes represent individual pages within a inventory_2 Batch. Batch Pages are created in one of two ways: (1) When images are scanned into a Batch using the Scan Viewer. (2) Or, when split from a PDF or TIFF file using the Split Pages activity.

  • Batch Pages are frequently referred to simply as "pages".

Batch Pages allow Grooper to process and store information at the page level, which is essential for operations that include Image Processing and text recognition (see Recognize). They enable the system to manage and process each page independently. This is critical for workflows that require page-specific actions and to take fullest advantage of Grooper's parallel processing capabilities.

How Batch Pages are created

Batch Pages are created in one of two typical ways:

  • They're scanned from paper pages
    Physical pages are acquired in Grooper by scanning them via the Scan Viewer.
  • They're "split" out of digital files
    Digital documents are acquired in Grooper from Import Jobs. For each imported file, a Batch Folder is created and the file is attached to it. Applying the Split Pages activity on a Batch Folder will create individual Batch Pages for each page in the file attached to the Batch Folder (must be a valid file type: PDF, TIFF or other supported image type).


Organization Nodes

"Organization nodes" refer to nodes in the Grooper node tree used to organize other nodes. These nodes include:

package_2 Projects
folder_open Folder nodes in the node tree
folder_data Local Resources Folders

Organization nodes are used to store resources at one level of the node tree or another. They may be very generic organizational tools (in the case of Folder nodes) or more specialized (in the case of Projects and Local Resources Folders).

  • "Folder" refers to both:
    • The main six folder nodes in the node tree (Batches, Projects, Processes, Queues, File Stores and Machines)
    • Any folder node added to the node tree to organize other nodes.
  • Project nodes are the primary containers for design components in Grooper.
    • They store node resources used to process document content (like a Content Model).
    • They do not store document content itself. Projects do not store Batches, Batch Folders, and Batch Pages (Those are stored in the "Batches" branch of the node tree).
  • Local Resources Folder nodes are specialized folders added to Content Types (e.g. Content Models and Document Types). They house node resources utilized for their parent Content Type and its decedents (like Data Rules and Data Types).

Related Objects

Folder

folder_open Folder objects refer to the various kinds of organizational folders in the node tree. Please do not confuse "Folder" with "Batch Folder". These are two different things. A Batch Folder is an integral part of the Batch hierarchy and used to represent a "document" in Grooper. A Folder is just a folder.

Local Resources Folder

folder_data A Local Resources Folder is a container object that can only be added to Content Types (e.g. Content Models and Document Types). It is a specialized folder that houses resources used for that Content Type and its Data Model such as Data Rules and Data Types.

Project

package_2 Projects are the primary containers for configuration nodes within Grooper. The Project is where various processing objects such as stacks Content Models, settings Batch Processes, profile objects are stored. This makes resources easier to manage, easier to save, and simplifies how node references are made in a Grooper Repository.

Content Types

Types of Content Types

In Grooper, the "Content Type" nodes consist of:

stacks Content Model ...
collections_bookmark Content Category and ...
description Document Type nodes.

These nodes create a classification taxonomy in Grooper. They define how documents are classified, what data to collect from a document, how different kinds of documents are related, and even how certain activities like Export should behave based on how a document is classified.

Content Types work together in Grooper to enable sophisticated document processing workflows. With different types of documents properly classified, they can have their data extracted and are handled according to the rules and behaviors defined by the Document Types within a Content Model.

The relationship between these Content Types is established through a hierarchical inheritance system. Content Categories and Document Types are building blocks within a Content Model seen as the "tree". Content Categories act as the "branches". Document Types are the "leaves" of the hierarchy.

Content Types and document classification

Documents are classified by having a Content Type (usually a Document Type) assigned either by the Classify activity, manually by a user, or other mechanisms in Grooper.

The Content Model plays a special role in defining the "Classify Method" used to classify documents. Classify Methods define the logic for

Content Types and data extraction

"Data Elements" represent information written on the document and contain instructions on how to collect it.

Data Elements can be defined for each Content Type by adding a Data Model. Data Elements (including Data Fields, Data Sections and Data Tables) are added these Data Models. Data Elements are inherited down the "tree" of the Content Type hierarchy.

  • Data Elements defined at the Content Model level are applied to all Content Types within the Content Model and will apply to the whole "tree".
  • Data Elements defined at the Content Category level are applied to all Content Types that exist within that specific "branch".
  • Data Elements defined on a Document Type will apply to that specific "leaf".


  • This is why documents must be "classified" in order to have their data extracted. It is the Content Type that determines which Data Model is used to collect data when the Extract activity runs.

Content Types and "Behaviors"

"Behaviors" are a set of different configurations that affect certain Activities and other areas of Grooper based on how a document is classified. They include:

  • Import Behaviors - Defining how documents and metadata are imported from CMIS Repositories based on their classification.
  • Export Behaviors - Defining how documents and data are exported based on their classification.
  • Labeling Behaviors - Defining how Label Sets are used for documents based on their classification.
  • PDF Data Mapping - Defining several PDF generation capabilities for documents based on their classification.
  • Indexing Behavior - Defining how documents are added to a Grooper search index based on their classification.

Behaviors also respect the Content Type hierarchy.

  • Behaviors defined at the Content Model level are applied to all Content Types within the Content Model, unless a child Content Type has its own Behavior configured. Content Category and Document Type Behavior configurations will override the Content Model configuration.
  • Behaviors defined at the Content Category level are applied to all Content Types within that branch, unless a child Content Type has its own Behavior configured. Child Content Category and Document Type Behavior configurations will override a parent Content Category configuration.
  • Behaviors defined at the Document Type level are applied to that Document Type only. Document Type Behavior configurations will override all parent Content Category and/or Content Model configurations.


Related Node Types

Content Model

stacks Content Model nodes define a classification taxonomy for document sets in Grooper. This taxonomy is defined by the collections_bookmark Content Categories and description Document Types they contain. Content Models serve as the root of a Content Type hierarchy, which defines Data Element inheritance and Behavior inheritance. Content Models are crucial for organizing documents for data extraction and more.

Content Category

collections_bookmark A Content Category is a container for other Content Category or description Document Type nodes in a stacks Content Model. Content Categories are often used simply as organizational buckets for Content Models with large numbers of Document Types. However, Content Categories are also necessary to create branches in a Content Model's classification taxonomy, allowing for more complex Data Element inheritance and Behavior inheritance.

Document Type

description Document Type nodes represent a distinct type of document, such as an invoice or a contract. Document Types are created as child nodes of a stacks Content Model or a collections_bookmark Content Category. They serve three primary purposes:

  1. They are used to classify documents. Documents are considered "classified" when the folder Batch Folder is assigned a Content Type (most typically, a Document Type).
  2. The Document Type's data_table Data Model defines the Data Elements extracted by the Extract activity (including any Data Elements inherited from parent Content Types).
  3. The Document Type defines all "Behaviors" that apply (whether from the Document Type's Behavior settings or those inherited from a parent Content Type).

What about Form Types and Page Types?

Technically speaking, Form Types and Page Types are also Content Types, but they aren't typically used in the same way. Form Types and Page Types are created automatically when training example documents for classification. They hold the feature weighting data for documents.

  • Form Types
    • When a Document Type is trained for classification, the training samples are created as Form Types.
    • Form Types are generated automatically when training documents for Lexical classification (and less commonly for Visual classification).
  • Page Types
    • The Page Types are the individual pages of a Form Type. All training weightings are stored on the Page Types for each page of the training document.
    • Page Types are generated automatically when training documents for Lexical classification (and less commonly for Visual classification).


Data Elements

Types of Data Elements

The "Data Element" nodes in Grooper consist of:

data_table Data Model ...
variables Data Field ...
insert_page_break Data Section ...
table Data Table and ...
view_column Data Column nodes .

Each of these nodes has its own function within Grooper's data extraction architecture but are also intimately related to each other.

The relationship between these Data Elements is hierarchical and modular.

  • The Data Model acts as the overall blueprint for data extraction.
  • Data Sections structure the document into logical parts. Data Sections can also serve as simple organizational objects within a Data Model to bucket similar "Data Elements" together.
  • Data Tables are incorporated into the model to handle tabular data. Each Data Table comprises Data Columns which specify the format and rules for columnar data extraction.
  • Finally, Data Fields are the fundamental units of data of any kind representing individual pieces of non-repeated data within a document. The exception to this is when Data Fields are contained within a "multi instance" Data Section that occurs repeatedly within a document.

Related Node Types

Data Model

data_table Data Models are leveraged during the Extract activity to collect data from documents (folder Batch Folders). Data Models are the root of a Data Element hierarchy. The Data Model and its child Data Elements define a schema for data present on a document. The Data Model's configuration (and its child Data Elements' configuration) define data extraction logic and settings for how data is reviewed in a Data Viewer.

Data Field

variables Data Fields represent a single value targeted for data extraction on a document. Data Fields are created as child nodes of a data_table Data Model and/or insert_page_break Data Sections.

  • Data Fields are frequently referred to simply as "fields".

Data Section

A insert_page_break Data Section is a container for Data Elements in a data_table Data Model. variables They can contain Data Fields, table Data Tables, and even Data Sections as child nodes and add hierarchy to a Data Model. They serve two main purposes:

  1. They can simply act as organizational buckets for Data Elements in larger Data Models.
  2. By configuring its "Extract Method", a Data Section can subdivide larger and more complex documents into smaller parts to assist in extraction.
    • "Single Instance" sections define a division (or "record") that appears only once on a document.
    • "Multi-Instance" sections define collection of repeating divisions (or "records").

Data Table

A table Data Table is a Data Element specialized in extracting tabular data from documents (i.e. data formatted in rows and columns).

  • The Data Table itself defines the "Table Extract Method". This is configured to determine the logic used to locate and return the table's rows.
  • The table's columns are defined by adding view_column Data Column nodes to the Data Table (as its children).

Data Column

view_column Data Columns represent columns in a table extracted from a document. They are added as child nodes of a table Data Table. They define the type of data each column holds along with its data extraction properties.

  • Data Columns are frequently referred to simply as "columns".
  • In the context of reviewing data in a Data Viewer, a single Data Column instance in a single Data Table row, is most frequently called a "cell".


Extractor Nodes

Types of Extractor Nodes

There are three types of Extractor Nodes in Grooper:

quick_reference_all Value Reader
pin Data Type
input Field Class
  • Advances in large-language models (LLMs) have largely made Field Classes obsolete. LLM-based extraction methods in Grooper (such as AI Extract) can achieve similar results with nowhere near the amount of set up.

All three of these node types perform a similar function. They return data from documents. However, they differ in their configuration and utility.

Extractor Nodes are tools to extract/return document data. But they don't do anything by themselves. They are used by extractor properties on other nodes in Grooper.

  • Example: When export_notes runs on a document, Data Elements (such as variables Data Fields) are ultimately used collect document data.
    It is a Data Field's "Value Extractor" property that does this. You may configure this property with an Extractor Node to do so.
  • Example: When executed in a insert_page_break Separate step, the Pattern-Based Separation provider's is ultimately what identifies patterns to separate Batch Pages into Batch Folders.
    It is its "Value Extractor" property that does this. However, you may configure this property with an Extractor Node to do so.
  • Example: When unknown_document Classify runs on a document, a description Document Type's "Positive Extractor" property will be used to assign a Batch Folder the Document Type if it returns a value.
    You may configure the Positive Extractor with an Extractor Node to do so.
  • And so on and so on for any extractor property for any node in Grooper.


To that end, Extractor Nodes serve three purposes:

  1. To be re-usable units of extraction
  2. To collate data
  3. To leverage machine learning algorithms to target data in the flow of text
    • Advances in large-language models (LLMs) have largely made Field Classes obsolete. LLM-based extraction methods in Grooper (such as AI Extract) can achieve similar results with nowhere near the amount of set up.

Re-usability

Extractor nodes are meant to be referenced either by other extractor nodes or, importantly, by Data Elements such as Data Fields in a Data Model.

For example, an individual Data Field can be configured on its own to collect a date value, such as the "Received Date" on an invoice. However, what if another Data Field is collecting a different date format, like the "Due Date" on the same invoice? In this case you would create one extractor node, like a Value Reader, to collect any and all date formats. You could then have each Data Field reference that single Value Reader and further configure each individual Data Field to differentiate their specific date value.

Data collation

Another example would be configuring a Data Type to target entire rows of information within a table of data. Several Value Reader nodes could be made as children of the Data Type, each targeting a specific value within the table row. The parent Data Type would then collate the results of its child Value Reader nodes into one result. A Data Table would then reference the Data Type to collect the appropriate rows of information.

Machine learning

Many documents contain important pieces of information buried within the flow of text, like a legal document. These types of documents and the data they contain require an entirely different approach to extracting data than a highly structured document like an invoice. For these situations you can use a "trainable" extractor known as a Field Class to leverage machine learning algorithms to target important information.

  • Advances in large-language models (LLMs) have largely made Field Classes obsolete. LLM-based extraction methods in Grooper (such as AI Extract) can achieve similar results with nowhere near the amount of set up.

Extractor Nodes vs Value Extractors

Extractor nodes should not be confused with "Value Extractors". There are many places in Grooper where extraction logic can be applied for one purpose or another. In these cases a Value Extractor is chosen to define the logic required to return a desired value.

In fact, the Extractor Nodes themselves will leverage specific Value Extractors to define their logic.

  • Example: "Value Readers" are configured using a single property "Extractor". This property specifies a single Value Extractor which determines how data is extracted. Value Readers are essentially an encapsulation of a single Value Extractor configuration that can be reused by multiple other extraction elements and properties, such as Data Fields and Data Types.
  • Example: "Data Types" have several properties that can be configured with Value Extractors, including its "Local Extractor", "Input Filter", and "Exclusion Extractor" properties.
  • Example" "Field Classes" cannot function without its "Value Extractor" and "Feature Extractor" properties configured, both of which specify a Value Extractor.

However, Extractor Nodes are used when you need to reference them for their designated strengths:

  • re-usability
  • collation
  • machine learning
    • Advances in large-language models (LLMs) have largely made Field Classes obsolete. LLM-based extraction methods in Grooper (such as AI Extract) can achieve similar results with nowhere near the amount of set up.

Related Node Types

Value Reader

quick_reference_all Value Reader nodes define a single data extraction operation. Each Value Reader executes a single Value Extractor configuration. The Value Extractor determines the logic for returning data from a text-based document or page. (Example: Pattern Match is a Value Extractor that returns data using regular expressions).

  • Value Readers are can be used on their own or in conjunction with pin Data Types for more complex data extraction and collation.

Data Type

pin Data Types are nodes used to extract text data from a document. Data Types have more capabilities than quick_reference_all Value Readers. Data Types can collect results from multiple extractor sources, including a locally defined extractor, child extractor nodes, and referenced extractor nodes. Data Types can also collate results using Collation Providers to combine, sift and manipulate results further.

  • For example, if you're extracting a date that could appear in multiple formats within a document, you'd use various extractor nodes (each capturing a different format) as children of a Data Type.

The Data Type also defines how to collate results from one or more extractors into a referenceable output. The simplest type of collation (Individual) would just return all individual extractors' results as a list of results.

Data Types are also used for recognizing complex 2D data structures, like address blocks or table rows. Different collation methods would be used in these cases to combine results in different ways.

Field Class

input Field Classes are NLP (natural language processing) based extractor nodes. They find values based on some natural language context near that value. Values are positively or negatively associated with text-based "features" nearby by training the extractor. During extraction, the extractor collects values based on these training weightings.

  • Field Classes are most useful when attempting to find values within the flow of natural language.
  • Field Classes can be configured to distinguish values within highly structured documents, but this type of extraction is better suited to simpler "extractor nodes" like quick_reference_all Value Readers or pin Data Types.
  • Advances in large-language models (LLMs) have largely made Field Classes obsolete. LLM-based extraction methods in Grooper (such as AI Extract) can achieve similar results with nowhere near the amount of set up.


Connection nodes

In Grooper, "connection nodes" play a vital role in integrating external data sources and repositories. They consist of:

cloud CMIS Connection ...
settings_system_daydream CMIS Repository and ...
database Data Connection nodes.

Each of these node types serve a unique purpose while also being related through their collaborative use in connecting and managing data across various platforms and databases.

These connections nodes are related in their collective ability to bridge Grooper with external data sources and content repositories.

  • CMIS Connections' serve as the gateway to multiple content management systems.
  • CMIS Repositories' uses the connection established by CMIS Connections to organize and manage document access for those systems.
  • Data Connections link Grooper to databases, allowing it to export data to databases, perform data lookups and synchronize with external structured data sources.

Together these connection nodes enable Grooper to extend its data processing capabilities beyond its local domain and integrate seamlessly with external systems for end-to-end document and data management.

Related Node Types

CMIS Connection

cloud CMIS Connections provide a standardized way of connecting to various content management systems (CMS). CMIS Connections allow Grooper to communicate with multiple external storage platforms, enabling access to documents and document metadata that reside outside of Grooper's immediate environment.

  • For those that support the CMIS standard, the CMIS Connection connects to the CMS using the CMIS standard.
  • For those that do not, the CMIS Connection normalizes connection and transfer protocol as if they were a CMIS platform.

CMIS Repository

settings_system_daydream CMIS Repository nodes provide document access in external storage platforms through a cloud CMIS Connection. With a CMIS Repository, users can manage and interact with those documents within Grooper. They are used primarily for import using Import Descendants and Import Query Results and for export using CMIS Export.

  • CMIS Repositories are create as a child node of a CMIS Connection using the "Import Repository" command.

Data Connection

database Data Connections connect Grooper to Microsoft SQL and supported ODBC databases. Once configured, Data Connections can be used to export data extracted from a document to a database, perform database lookups to validate data Grooper collects and other actions related to database management systems (DBMS).

  • Grooper supports MS SQL Server connectivity with the "SQL Server" connection method.
  • Grooper supports Oracle, PostgreSQL, Db2, and MySQL connectivity with the "ODBC" connection method.

Profile Objects

"Profile Objects" in Grooper serve as pre-configured settings templates used across various stages of document processing, such as scanning, image cleanup, and document separation. These objects, which include:

perm_media IP Profile ...
gallery_thumbnail IP Group ...
image IP Step ...
library_books OCR Profile ...
scanner Scanner Profile and ...
insert_page_break Separation Profile ...

... have their own individual functions but are also related by defining structured approaches to handling documents within Grooper.

By creating distinct profiles for each aspect of the document processing pipeline, Grooper allows for customization and optimization of each step. This standardizes settings across similar document types or processing requirements, which can contribute to consistency and efficiency in processing tasks. These "Profile Objects" collectively establish a comprehensive, repeatable, and optimized workflow for processing documents from the point of capture to the point of data extraction.

Related Objects

IP Profile

perm_media IP Profiles are a step-by-step list of image processing operations (IP Commands). They are used for several image processing related operations, but primarily for:

  1. Permanently enhancing an image during the Image Processing activity (usually to get rid of defects in a scanned image, such as skewing or borders).
  2. Cleaning up an image in-memory during the Recognize activity without altering the image to improve OCR accuracy.
  3. Computer vision operations that collect layout data (table line locations, OMR checkboxes, barcode value and more) utilized in data extraction.

IP Group

gallery_thumbnail IP Groups are containers of image IP Steps and/or IP Groups that can be added to perm_media IP Profiles. IP Groups add hierarchy to IP Profiles. They serve two primary purposes:

  1. They can be used simply to organize IP Steps for IP Profiles with large numbers of steps.
  2. They are often used with "Should Execute Expressions" and "Next Step Expressions" to conditionality execute a sequence of IP Steps.

IP Step

image IP Steps are the basic units of an perm_media IP Profile. They define a single image processing operation, called an IP Command in Grooper.

OCR Profile

library_books OCR Profiles store configuration settings for optical character recognition (OCR). They are used by the Recognize activity to convert images of text on contract Batch Pages into machine-encoded text. OCR Profiles are highly configurable, allowing fine-grained control over how OCR occurs, how pre-OCR image cleanup occurs, and how Grooper's OCR Synthesis occurs. All this works to the end goal of highly accurate OCR text data, which is used to classify documents, extract data and more.

Scanner Profile

scanner Scanner Profiles store configuration settings for operating a document scanner. Scanner Profiles provide users operating the Scan Viewer in the Review activity a quick way to select pre-saved scanner configurations.

Separation Profile

insert_page_break Separation Profiles store settings that determine how contract Batch Pages are separated into folder Batch Folders. Separation Profiles can be referenced in two ways:

  • In a Review activity's Scan Viewer settings to control how pages are separated in real time during scanning.
  • In a Separate activity as an alternative to configuring separation settings locally.

Queue Objects

"Queue Objects" in Grooper are structures designed to manage and distribute tasks within the document processing workflow. There are two main types of queues:

memory Processing Queue and ...
person_play Review Queue ...

... each with a distinct function but inherently interconnected as they both coordinate the flow of work through Grooper.

The relationship between Processing Queues and Review Queues lies in their roles in managing the workflow and task distribution in Grooper. Both facilitate the progression of document processing from automatic operations to those requiring human intervention.

  • Processing Queues handle the automation side of the operation, ensuring that machine tasks are efficiently allocated across the available resources.
  • Review Queues oversee the user-driven aspects of the workflow, particularly quality control and verification processes that require manual input.

Together, these queues ensure a smooth transition between automated and manual stages of document processing and help maintain order and efficiency within the system.

Related Objects

Processing Queue

memory Processing Queues help automate "machine performed tasks" (Those are Code Activity tasks performed by computer Machines and their Activity Processing services). Processing Queues are assigned to Batch Process Steps to distribute tasks, control the maximum processing rate, and set the "concurrency mode" (specifying if and how parallelism can occur across one or more servers).

  • Processing Queues are used to dedicate Activity Processing services with a capped number of processing threads to resource intensive activities, such as Recognize. That way, these compute hungry tasks won't gobble up all available system resources.
  • Processing Queues are also used to manage activities, such as Render, who can only have one activity instance running per machine (This is done by changing the queue's Concurrency Mode from "Maximum" to "Per Machine").
  • Processing Queues are also used to throttle Export tasks in scenarios where the export destination can only accept one document at a time.

Review Queue

person_play Review Queues help organize and filter human-performed Review activity tasks. User groups are assigned to each Review Queue, which is then set either on a settings Batch Process or a Review step. Based on a user's membership in Review Queues, this will affect how inventory_2 Batches are distributed in the Batches page and how Review tasks are distributed in the Tasks page.

Process Objects

"Process Objects" in Grooper, which include...

settings Batch Process and ...
edit_document Batch Process Step ...

... are closely related in managing and executing a sequence of steps designed to process a collection of documents known as a Batch

Note: The icon for a Batch Process Step will change depending on how you add the object to a Batch Process. If you use the "Add" object-command it will give the Batch Process Step the icon used above. If you use the "Add Activity" object command, it will give the Batch Process Step an icon according the the activity chosen.
Below is an example of a Batch Process with several child Batch Process Steps that were added using the "Add Activity" object-command:
settings Batch Process
description Split Pages
format_letter_spacing_wide Recognize
insert_page_break Separate
unknown_document Classify
export_notes Extract
person_search Review
output Export
inventory_2 Dispose Batch

A Batch Process consists of a series of Batch Process Steps meant to be executed in a particular sequence for a batch of documents. Before a Batch Process can be used in production, it must be "published". Publishing a Batch Process will create a read-only copy in the "Processes" folder of the node tree, making it accessible for production purposes.

In essence, a Batch Process defines the overall workflow for processing documents. It relies on Batch Process Steps to perform each action required during the process. Each Batch Process Step represents a discrete operation, or "activity", within the broader scope of the Batch Process. Batches Processes and Batch Process Steps work together to ensure that documents are handled in a consistent and controlled manner.

Related Objects

Batch Process

settings Batch Process nodes are crucial components in Grooper's architecture. A Batch Process is the step-by-step processing instructions given to a inventory_2 Batch. Each step is comprised of a "Code Activity" or a Review activity. Code Activities are automated by Activity Processing services. Review activities are executed by human operators in the Grooper user interface.

  • Batch Processes by themselves do nothing. Instead, they execute edit_document Batch Process Steps which are added as children nodes.
  • A Batch Process is often referred to as simply a "process".

Batch Process Step

edit_document Batch Process Steps are specific actions within a settings Batch Process sequence. Each Batch Process Step performs an "Activity" specific to some document processing task. These Activities will either be a "Code Activity" or "Review" activities. Code Activities are automated by Activity Processing services. Review activities are executed by human operators in the Grooper user interface.

  • Batch Process Steps are frequently referred to as simply "steps".
  • Because a single Batch Process Step executes a single Activity configuration, they are often referred to by their referenced Activity as well. For example, a "Recognize step".

Architecture Objects

In Grooper, "Architecture Objects" organize and oversee the infrastructure and framework of the Grooper repository. A "Grooper Repository" is a tree structure of nodes representing both configuration and content objects. These objects include the...

database Root ...
hard_drive File Store and ...
computer Machine objects ...

... each with distinct roles but also working in conjunction to manage resources and information flow within the repository.

The relationship among these "Architecture Objects" is foundational to the operation and scalability of Grooper's document processing capabilities.

  • The Root object provides a base structure.
  • The Filestore offers a storage utility for files and content.
  • The Machine objects represent the hardware resources for performing processing tasks.

Together, they comprise the essential components that underpin the function and manageability of the Grooper ecosystem.

Related Objects

Root

The Grooper database Root node is the topmost element of the Grooper Repository. All other nodes in a Grooper Repository are its children/descendants. The Grooper Root also stores several settings that apply to the Grooper Repository, including the license serial number or license service URL and Repository Options.

File Store

hard_drive File Store nodes are a key part of Grooper's "database and file store" architecture. They define a storage location where file content associated with Grooper nodes are saved. This allows processing tasks to create, store and manipulate content related to documents, images, and other "files".

  • Not every node in Grooper will have files associated with it, but if it does, those files are stored in the Windows folder location defined by the File Store node.

Machine

computer Machine nodes represent servers that have connected to the Grooper Repository. They are essential for distributing task processing loads across multiple servers. Grooper creates Machine nodes automatically whenever a server makes a new connection to a Grooper Repository's database. Once added, Machine nodes can be used to view server information and to manage Grooper Service instances.

Miscellaneous Objects

The following ojbects are related only in that they don't fit neatly into the groups defined above in this article.

(un)Related Objects

AI Analyst

An psychology AI Analyst object in Grooper defines a role meant to harness artificial intelligence capabilities, particularly from OpenAI. AI Analyst objects assist with chat sessions and other interactive tasks. This object is set up to act as an AI-driven analyst, which can be configured with specific models such as "gpt-4-1106-preview" to serve as the "brain" of the analyst, performing complex AI functions.

Key properties of an AI Analyst include:

  • Model: Defines the OpenAI model that powers the AI Analyst, impacting its cognitive capabilities.
  • Enable Code Interpreter: Specifies whether to allow the AI Analyst to write and run Python code in a sandboxed environment. This enables the AI Analyst to process diverse data or generate output like data files and graphs.
  • Instructions: Detailed instructions provided to guide the responses and behavior of the AI Analyst during interaction.
  • Knowledge: Sets the scope of knowledge available to the AI Analyst to inform its understanding and responses.
  • Predefined Messages: A list of predetermined messages that the AI Analyst can use during chat sessions.

The AI Analyst object must have connectivity established with OpenAI. This is done by configuring the Options property on the Root of the Grooper repository. The AI Analyst object can be applied to facilitate AI interactions in chat sessions, offering dynamic responses based on the conversational context, the provided knowledge base, and specific instructions set up for the analyst's behavior .

Control Sheet

document_scanner Control Sheets in Grooper are special pages used to control various aspects of the document scanning process. Control Sheets can serve multiple functions such as:

  • separating and classifying documents
  • changing image settings dynamically
  • create a new folder with specific Content Types
  • trigger other actions that affect how documents are handled as they pass through the scanning equipment

Control sheets are pre-printed with barcodes or other markers that Grooper recognizes and uses to perform specific actions based on the presence of the sheet. For instance, when a control sheet instructs the creation of a new folder it can influence the hierarchy within a batch. This enables the management and organization of documents without manual intervention during the Scan activity.

Overall, Control Sheets are an intelligent way to guide the scanning workflow. Control Sheets can ensure that batches of documents are organized and processed according to predefined rules, thereby automating the structuring of scanned content into logical units within Grooper.

Data Rule

flowsheet Data Rules are used to normalize or otherwise prepare data collected in a data_table Data Model for downstream processes. Data Rules define data manipulation logic for data extracted from documents (folder Batch Folders) to ensure data conforms to expected formats or meets certain standards.

  • Each Data Rule executes a "Data Action" which do things like computing a field's value, parse a field into other fields, perform lookups, and more.
  • Data Actions can be conditionally executed based on a Data Rule's "Trigger" expression.
  • A hierarchy of Data Rules can be created to execute multiple Data Actions and perform complex data transformation tasks.
  • Data Rules can be applied by:
    • The Apply Rules activity (must be done after data is collected by the Extract activity)
    • The Extract activity (will run after the Data Model extraction)
    • The Convert Data activity when converting document to another Document Type
    • They can be applied manually in a Data Viewer with the "Run Rule" command.

The execution of a Data Rule takes place during the Apply Rules activity. Data Rules can be applied at different scopes such as each individual type of "Data Element". The rule can be set to execute conditionally based on a Trigger expression. If the Trigger evaluates to true, the Data Rule's True Action is applied, and if false, its False Action is executed. Data Rules can recursively apply logic to the hierarchy of data within a document instance, enabling complex data transformation and normalization operations that reflect the structure of the extracted data.

Overall, Data Rules in Grooper simplify extractors by separating the data normalization logic from the extraction logic, allowing for flexible and powerful post-extraction data processing .

Lexicon

dictionary Lexicons are dictionaries used throughout Grooper to store lists of words, phrases, weightings for Fuzzy RegEx, and more. Users can add entries to a Lexicon, Lexicons can import entries from other Lexicons by referencing them, and entries can be dynamically imported from a database using a database Data Connection. Lexicons are commonly used to aid in data extraction, with the "List Match" and "Word Match" extractors utilizing them most commonly.

Object Library

extension Object Library nodes are .NET libraries that contain code files for customizing the Grooper's functionality. These libraries are used for a range of customization and integration tasks, allowing users to extend Grooper's capabilities.

Examples include:
  • Adding custom Activities that execute within Batch Processes
  • Creating custom commands available during the Review activity and in the Design page.
  • Defining custom methods that can be called from code expressions on Data Field and Batch Process Step objects.
  • Creating custom Connection Types for CMIS Connections for import/export operations from/to CMS systems.
  • Establish custom Grooper Services that perform automated background tasks at regular intervals

Resource File

Resource Files are nodes you can add to a package_2 Project and store any kind of file. Each Resource File stores one file. While you can use Resource Files to store any kind of file in a Project, there are several areas in Grooper that can reference Resource Files to one end or another, including XML schema files used for Grooper's XML Schema Integration.