Object Nomenclature (Concept): Difference between revisions

From Grooper Wiki
Line 153: Line 153:
=== Related Objects ===
=== Related Objects ===
==== Batch Process ====
==== Batch Process ====
[[image:GrooperIcon_BatchProcess.png]] <section begin="BatchProcess" />'''[[Batch Process|Batch Processes]]''' are crucial components in '''Grooper's''' architecture. A '''Batch Process''' orchestrates the document processing strategy and ensures each batch of documents is managed systematically and efficiently.
{{#lst:Glossary|Batch Process}}
 
'''Batch Processes''' by themselves do nothing.  Instead, the workflows they execute are designed by adding child '''Batch Process Steps'''.<section end="BatchProcess" />
 
==== Batch Process Step ====
==== Batch Process Step ====
[[image:GrooperIcon_BatchProcessStep.png]] <section begin="BatchProcessStep" />'''[[Batch Process Step|Batch Process Steps]]''' are specific actions within the sequence defined by a '''Batch Process'''. A '''Batch Procses Step''' plays a critical role in automating and managing the flow of documents through the various stages of processing within '''Grooper'''.<section end="BatchProcessStep" />
{{#lst:Glossary|Batch Process Step}}


== Architecture Objects ==
== Architecture Objects ==

Revision as of 15:50, 26 April 2024

A Grooper environment consists of many interrelated objects.

The Grooper Wiki's Object Nomenclature defines how Grooper users categorize and refer to different types of Node Objects in a Grooper Repository. Knowing what objects can be added to the Grooper Node Tree and how they are related is a critical part of understanding Grooper itself.

About

In Grooper, understanding the objects within the platform involves recognizing how various elements can serve similar functions and therefore be grouped together based on their shared functionalities. This concept stems from the recognition that disparate objects often perform analogous tasks, albeit with differing characteristics or representations.

By discerning commonalities in functionality across diverse objects, users can streamline their approach to data processing and analysis within Grooper. Rather than treating each object in isolation, users can categorize them based on their functional similarities, thus simplifying management and enhancing efficiency.

This approach fosters a more holistic understanding of the data ecosystem within Grooper, empowering users to devise more effective strategies for data extraction, classification, and interpretation. By recognizing the underlying functional relationships between objects, users can optimize workflows, improve accuracy, and derive deeper insights from their data.

High Level Overview

This article is meant to be a high level overview of all the objects in Grooper and how they're related. If you need more specific information on a particular object, please click the hyperlink for that specific object (as listed in the category's "Related Objects" section) to be taken to an article giving more informatoin on that object.

Batch Objects

In Grooper, "Batch Objects" represent the hierarchical structure of documents being processed and consist of:

Batch ...
Batch Folder and ...
Batch Page objects ...

... each serving a distinct function within this hierarchy but also being fundamentally related.

The relationship between these objects is hierarchical in nature. The Batch object is the top level. It contains:

  • Batch Folders and ...
  • Batch Pages

Batch Folders may contain either further Batch Folders (to represent subfolders or grouped documents) or Batch Pages (to represent individual pages of documents). This structured approach allows Grooper to efficiently manage and process documents at various levels of granularity — from a full batch down to individual pages.

Related Objects

Batch

inventory_2 Batch nodes are fundamental in Grooper's architecture. They are containers of documents that are moved through workflow mechanisms called settings Batch Processes. Documents and their pages are represented in Batches by a hierarchy of folder Batch Folders and contract Batch Pages.

Batch Folder

The folder Batch Folder is an organizational unit within a inventory_2 Batch, allowing for a structured approach to managing and processing a collection of documents. Batch Folder nodes serve two purposes in a Batch. (1) Primarily, they represent "documents" in Grooper. (2) They can also serve more generally as folders, holding other Batch Folders and/or contract Batch Page nodes as children.

  • Batch Folders are frequently referred to simply as "documents" or "folders" depending on how they are used in the Batch.

Batch Page

contract Batch Page nodes represent individual pages within a inventory_2 Batch. Batch Pages are created in one of two ways: (1) When images are scanned into a Batch using the Scan Viewer. (2) Or, when split from a PDF or TIFF file using the Split Pages activity.

  • Batch Pages are frequently referred to simply as "pages".

They are created in one of two ways:

  • Physical pages can be acquired in Grooper by scanning them via the Grooper Desktop application.
  • Digital documents are acquired in Grooper as whole objects and represented as Batch Folders. Applying the Split Pages activity on a Batch Folder that represents a digital document will expose Batch Page objects as direct children.

Batch Pages allow Grooper to process and store information at the page level, which is essential for operations that include Image Processing and recognition of text (see Recognize). They enable the system to manage and process each page independently. This is critical for workflows that require detailed page-specific actions or for Batches composed of documents with different processing requirements per page.

Content Type Objects

Types of Content Types

In Grooper, the "Content Type" nodes consist of:

stacks Content Model ...
collections_bookmark Content Category and ...
description Document Type nodes.

These nodes create a classification taxonomy in Grooper. They define how documents are classified, what data to collect from a document, how different kinds of documents are related, and even how certain activities like Export should behave based on how a document is classified.

Content Types work together in Grooper to enable sophisticated document processing workflows. With different types of documents properly classified, they can have their data extracted and are handled according to the rules and behaviors defined by the Document Types within a Content Model.

The relationship between these Content Types is established through a hierarchical inheritance system. Content Categories and Document Types are building blocks within a Content Model seen as the "tree". Content Categories act as the "branches". Document Types are the "leaves" of the hierarchy.

Content Types and document classification

Documents are classified by having a Content Type (usually a Document Type) assigned either by the Classify activity, manually by a user, or other mechanisms in Grooper.

The Content Model plays a special role in defining the "Classify Method" used to classify documents. Classify Methods define the logic for

Content Types and data extraction

"Data Elements" represent information written on the document and contain instructions on how to collect it.

Data Elements can be defined for each Content Type by adding a Data Model. Data Elements (including Data Fields, Data Sections and Data Tables) are added these Data Models. Data Elements are inherited down the "tree" of the Content Type hierarchy.

  • Data Elements defined at the Content Model level are applied to all Content Types within the Content Model and will apply to the whole "tree".
  • Data Elements defined at the Content Category level are applied to all Content Types that exist within that specific "branch".
  • Data Elements defined on a Document Type will apply to that specific "leaf".


  • This is why documents must be "classified" in order to have their data extracted. It is the Content Type that determines which Data Model is used to collect data when the Extract activity runs.

Content Types and "Behaviors"

"Behaviors" are a set of different configurations that affect certain Activities and other areas of Grooper based on how a document is classified. They include:

  • Import Behaviors - Defining how documents and metadata are imported from CMIS Repositories based on their classification.
  • Export Behaviors - Defining how documents and data are exported based on their classification.
  • Labeling Behaviors - Defining how Label Sets are used for documents based on their classification.
  • PDF Data Mapping - Defining several PDF generation capabilities for documents based on their classification.
  • Indexing Behavior - Defining how documents are added to a Grooper search index based on their classification.

Behaviors also respect the Content Type hierarchy.

  • Behaviors defined at the Content Model level are applied to all Content Types within the Content Model, unless a child Content Type has its own Behavior configured. Content Category and Document Type Behavior configurations will override the Content Model configuration.
  • Behaviors defined at the Content Category level are applied to all Content Types within that branch, unless a child Content Type has its own Behavior configured. Child Content Category and Document Type Behavior configurations will override a parent Content Category configuration.
  • Behaviors defined at the Document Type level are applied to that Document Type only. Document Type Behavior configurations will override all parent Content Category and/or Content Model configurations.


Related Node Types

Content Model

stacks Content Model nodes define a classification taxonomy for document sets in Grooper. This taxonomy is defined by the collections_bookmark Content Categories and description Document Types they contain. Content Models serve as the root of a Content Type hierarchy, which defines Data Element inheritance and Behavior inheritance. Content Models are crucial for organizing documents for data extraction and more.

Content Category

collections_bookmark A Content Category is a container for other Content Category or description Document Type nodes in a stacks Content Model. Content Categories are often used simply as organizational buckets for Content Models with large numbers of Document Types. However, Content Categories are also necessary to create branches in a Content Model's classification taxonomy, allowing for more complex Data Element inheritance and Behavior inheritance.

Document Type

description Document Type nodes represent a distinct type of document, such as an invoice or a contract. Document Types are created as child nodes of a stacks Content Model or a collections_bookmark Content Category. They serve three primary purposes:

  1. They are used to classify documents. Documents are considered "classified" when the folder Batch Folder is assigned a Content Type (most typically, a Document Type).
  2. The Document Type's data_table Data Model defines the Data Elements extracted by the Extract activity (including any Data Elements inherited from parent Content Types).
  3. The Document Type defines all "Behaviors" that apply (whether from the Document Type's Behavior settings or those inherited from a parent Content Type).

What about Form Types and Page Types?

Technically speaking, Form Types and Page Types are also Content Types, but they aren't typically used in the same way. Form Types and Page Types are created automatically when training example documents for classification. They hold the feature weighting data for documents.

  • Form Types
    • When a Document Type is trained for classification, the training samples are created as Form Types.
    • Form Types are generated automatically when training documents for Lexical classification (and less commonly for Visual classification).
  • Page Types
    • The Page Types are the individual pages of a Form Type. All training weightings are stored on the Page Types for each page of the training document.
    • Page Types are generated automatically when training documents for Lexical classification (and less commonly for Visual classification).


Data Element Objects

Types of Data Elements

The "Data Element" nodes in Grooper consist of:

data_table Data Model ...
variables Data Field ...
insert_page_break Data Section ...
table Data Table and ...
view_column Data Column nodes .

Each of these nodes has its own function within Grooper's data extraction architecture but are also intimately related to each other.

The relationship between these Data Elements is hierarchical and modular.

  • The Data Model acts as the overall blueprint for data extraction.
  • Data Sections structure the document into logical parts. Data Sections can also serve as simple organizational objects within a Data Model to bucket similar "Data Elements" together.
  • Data Tables are incorporated into the model to handle tabular data. Each Data Table comprises Data Columns which specify the format and rules for columnar data extraction.
  • Finally, Data Fields are the fundamental units of data of any kind representing individual pieces of non-repeated data within a document. The exception to this is when Data Fields are contained within a "multi instance" Data Section that occurs repeatedly within a document.

Related Node Types

Data Model

data_table Data Models are leveraged during the Extract activity to collect data from documents (folder Batch Folders). Data Models are the root of a Data Element hierarchy. The Data Model and its child Data Elements define a schema for data present on a document. The Data Model's configuration (and its child Data Elements' configuration) define data extraction logic and settings for how data is reviewed in a Data Viewer.

Data Field

variables Data Fields represent a single value targeted for data extraction on a document. Data Fields are created as child nodes of a data_table Data Model and/or insert_page_break Data Sections.

  • Data Fields are frequently referred to simply as "fields".

Data Section

A insert_page_break Data Section is a container for Data Elements in a data_table Data Model. variables They can contain Data Fields, table Data Tables, and even Data Sections as child nodes and add hierarchy to a Data Model. They serve two main purposes:

  1. They can simply act as organizational buckets for Data Elements in larger Data Models.
  2. By configuring its "Extract Method", a Data Section can subdivide larger and more complex documents into smaller parts to assist in extraction.
    • "Single Instance" sections define a division (or "record") that appears only once on a document.
    • "Multi-Instance" sections define collection of repeating divisions (or "records").

Data Table

A table Data Table is a Data Element specialized in extracting tabular data from documents (i.e. data formatted in rows and columns).

  • The Data Table itself defines the "Table Extract Method". This is configured to determine the logic used to locate and return the table's rows.
  • The table's columns are defined by adding view_column Data Column nodes to the Data Table (as its children).

Data Column

view_column Data Columns represent columns in a table extracted from a document. They are added as child nodes of a table Data Table. They define the type of data each column holds along with its data extraction properties.

  • Data Columns are frequently referred to simply as "columns".
  • In the context of reviewing data in a Data Viewer, a single Data Column instance in a single Data Table row, is most frequently called a "cell".


Extractor Objects

Connection Objects

In Grooper, "Connection Objects" play a vital role in integrating external data sources and repositories. They consist of:

CMIS Connection ...
CMIS Repository and ...
Data Connection objects.

Each of these objects serve a unique purpose while also being related through their collaborative use in connecting and managing data across various platforms and databases.

These Connection Objects are related in their collective ability to bridge Grooper with external data sources and content repositories.

  • The CMIS Connection object serves as the gateway to multiple content management systems.
  • The CMIS Repository object uses this connection to organize and manage document access for those systems.
  • The Data Connection object links Grooper to databases, allowing it to perform data lookups and synchronize with external structured data sources.

Together these Connection Objects enable Grooper to extend its data processing capabilities beyond its local domain and integrate seamlessly with external systems for end-to-end document and data management.

Related Objects

CMIS Connection

CMIS Connections in Grooper provide a standardized way of connecting to various content management systems (CMS).

  • For those that support the CMIS standard, the CMIS Connection connects to the CMS using the CMIS standard.
  • For those that do not, the CMIS Connection normalizes connection and transfer protocol as if they were a CMIS platform.

This object allows Grooper to communicate with multiple external storage platforms, enabling access to documents and content that reside outside of Grooper's immediate environment.

CMIS Repository

CMIS Repositories represent a logical container for documents on an external storage platform that is accessed via a CMIS Connection. These objects facilitate the organization and retrieval of documents stored in a CMIS-compliant repository, enabling Grooper to work with documents as if they were within its local infrastructure . A CMIS Repoistory ojbect is created as a "child" of the CMIS Connection object via "Import" button found in the top-right of the UI after successfully configuring the CMIS Connection object and creating a connection to its destination. The CMIS Repository object is referenced for lookups, CMIS Import, and the Export activity.

Data Connection

Data Connections define the settings necessary to establish connectivity with a database. A Data Connection object holds the configuration details required for connecting to and interacting with a database. These interactions may include conducting lookups, exports, or other actions that relate to database management systems (DBMS). Once configured, a Data Connection object can be referenced by other components in Grooper for various DBMS-related activities.

Profile Objects

"Profile Objects" in Grooper serve as pre-configured settings templates used across various stages of document processing, such as scanning, image cleanup, and document separation. These objects, which include:

IP Profile ...
IP Group ...
IP Step ...
OCR Profile ...
Scanner Profile and ...
Separation Profile ...

... have their own individual functions but are also related by defining structured approaches to handling documents within Grooper.

By creating distinct profiles for each aspect of the document processing pipeline, Grooper allows for customization and optimization of each step. This standardizes settings across similar document types or processing requirements, which can contribute to consistency and efficiency in processing tasks. These "Profile Objects" collectively establish a comprehensive, repeatable, and optimized workflow for processing documents from the point of capture to the point of data extraction.

Related Objects

IP Profile

Image Processing (IP) Profiles detail the operations and parameters for image enhancement and cleanup. These operations improve the accuracy of further processing steps, like OCR for the Recognize activity, or classification.

IP Group

IP Groups are subsidiary objects within an IP Profile that create a hierarchical structure for organizing image processing commands. IP Groups may contain other IP Groups or IP Step objects.

IP Step

IP Steps are the basic units within an IP Profile that define a single image processing operation. IP Steps are performed sequentially within their parent IP Group or IP Profile.

OCR Profile

OCR Profiles configure the settings for optical character recognition (OCR) leveraged by the Recognize activity. "OCR" converts images of text into machine-encoded text. OCR Profile objects influence how effectively textual content is recognized and extracted from document images.

Scanner Profile

Scanner Profiles outline the specifications for scanning physical documents into digital forms. This includes settings like resolution, color mode, and any post-scan image processing or enhancement functions.

See Desktop Scanning in Grooper for more information.

Separation Profile

Separation Profiles contain rules and settings that determine how batches of scanned pages are separated into individual documents or sections, often using barcodes, blank pages, or patch codes as indicators for separation points.

Queue Objects

"Queue Objects" in Grooper are structures designed to manage and distribute tasks within the document processing workflow. There are two main types of queues:

Processing Queue and ...
Review Queue ...

... each with a distinct function but inherently interconnected as they both coordinate the flow of work through Grooper.

The relationship between Processing Queues and Review Queues lies in their roles in managing the workflow and task distribution in Grooper. Both facilitate the progression of document processing from automatic operations to those requiring human intervention.

  • Processing Queues handle the automation side of the operation, ensuring that machine tasks are efficiently allocated across the available resources.
  • Review Queues oversee the user-driven aspects of the workflow, particularly quality control and verification processes that require manual input.

Together, these queues ensure a smooth transition between automated and manual stages of document processing and help maintain order and efficiency within the system.

Related Objects

Processing Queue

Processing Queues are designed for tasks performed by machines, which include automated steps in the document processing lifecycle. Processing Queues are used to distribute machine tasks among different servers and control the concurrency or processing rate of these tasks.

  • For example, activities such as rendering documents or exporting data can be managed so that only one activity instance runs per machine or so multiple instances are processed concurrently, according to the queue configuration.

Review Queue

Review Queues are designated for human-performed tasks. They organizes the Review tasks that require human attention and can distribute these tasks among different groups of users based on the queue's settings. Review Queues can be assigned on the Batch Process level to filter work by an entire process or Review activities at the Batch Process Step level to filter tasks at a more granular step-based level.

Process Objects

"Process Objects" in Grooper, which include...

Batch Process and ...
Batch Process Step ...

... are closely related in managing and executing a sequence of steps designed to process a collection of documents known as a Batch

Note: The icon for a Batch Process Step will change depending on how you add the object to a Batch Process. If you use the "Add" object-command it will give the Batch Process Step the icon used above. If you use the "Add Activity" object command, it will give the Batch Process Step an icon according the the activity chosen.
Below is an example of a Batch Process with several child Batch Process Steps that were added using the "Add Activity" object-command:
Batch Process
Split Pages
Recognize
Separate
Classify
Extract
Review
Export
Dispose Batch

A Batch Process consists of a series of Batch Process Steps meant to be executed in a particular sequence for a batch of documents. Before a Batch Process can be used in production, it must be "published". Publishing a Batch Process will create a read-only copy in the "Processes" folder of the node tree, making it accessible for production purposes.

In essence, a Batch Process defines the overall workflow for processing documents. It relies on Batch Process Steps to perform each action required during the process. Each Batch Process Step represents a discrete operation, or "activity", within the broader scope of the Batch Process. Batches Processes and Batch Process Steps work together to ensure that documents are handled in a consistent and controlled manner.

Related Objects

Batch Process

settings Batch Process nodes are crucial components in Grooper's architecture. A Batch Process is the step-by-step processing instructions given to a inventory_2 Batch. Each step is comprised of a "Code Activity" or a Review activity. Code Activities are automated by Activity Processing services. Review activities are executed by human operators in the Grooper user interface.

  • Batch Processes by themselves do nothing. Instead, they execute edit_document Batch Process Steps which are added as children nodes.
  • A Batch Process is often referred to as simply a "process".

Batch Process Step

edit_document Batch Process Steps are specific actions within a settings Batch Process sequence. Each Batch Process Step performs an "Activity" specific to some document processing task. These Activities will either be a "Code Activity" or "Review" activities. Code Activities are automated by Activity Processing services. Review activities are executed by human operators in the Grooper user interface.

  • Batch Process Steps are frequently referred to as simply "steps".
  • Because a single Batch Process Step executes a single Activity configuration, they are often referred to by their referenced Activity as well. For example, a "Recognize step".

Architecture Objects

In Grooper, "Architecture Objects" organize and oversee the infrastructure and framework of the Grooper repository. A "Grooper Repository" is a tree structure of nodes representing both configuration and content objects. These objects include the...

Root ...
Project ...
FileStore and ...
Machine objects ...

... each with distinct roles but also working in conjunction to manage resources and information flow within the repository.

The relationship among these "Architecture Objects" is foundational to the operation and scalability of Grooper's document processing capabilities.

  • The Root object provides a base structure.
  • The Project object defines the processing and design resources.
  • The Filestore offers a storage utility for files and content.
  • The Machine objects represent the hardware resources for performing processing tasks.

Together, they comprise the essential components that underpin the function and manageability of the Grooper ecosystem.

Related Objects

Root

The Root object represents the topmost element of the Grooper repository. It serves as the starting point from which all other objects branch out. It is the anchor point for all other structures within the repository and a necessary element for the organization and linkage of all other objects within Grooper.

Project

Projects are collections of resources and serve as the primary containers for design components within Grooper. The Project object is where various processing objects such as Content Models, Batch Processes, "Profile Objects", and more are organized and managed. It allows for the encapsulation and modularization of these resources for easier management and reusability.

Filestore

File Stores are storage locations within Grooper where file content associated with nodes are saved. They are crucial for managing the content that forms the basis of the Grooper's processing tasks, allowing for the storage and retrieval of documents, images, and other "files".

Machine

Machines represent servers that have connected to the Grooper repository. They allow for the management of Grooper Service Instances and serve as a connection points for processing jobs to be executed on the server hardware. Machine objects are essential for the scaling of processing capabilities and for distributing processing loads across multiple servers

Miscellaneous Objects

The following ojbects are related only in that they don't fit neatly into the groups defined above in this article.

(un)Related Objects

Control Sheet

Control Sheets in Grooper are special pages used to control various aspects of the document scanning process. Control Sheets can serve multiple functions such as:

  • separating and classifying documents
  • changing image settings dynamically
  • create a new folder with specific Content Types
  • trigger other actions that affect how documents are handled as they pass through the scanning equipment

Control sheets are pre-printed with barcodes or other markers that Grooper recognizes and uses to perform specific actions based on the presence of the sheet. For instance, when a control sheet instructs the creation of a new folder it can influence the hierarchy within a batch. This enables the management and organization of documents without manual intervention during the Scan activity.

Overall, Control Sheets are an intelligent way to guide the scanning workflow. Control Sheets can ensure that batches of documents are organized and processed according to predefined rules, thereby automating the structuring of scanned content into logical units within Grooper.

Data Rule

Data Rules in Grooper define the logic for automated data manipulation which occurs after data has been extracted from documents. These rules are applied to normalize or otherwise prepare data for downstream processes. Data Rules ensure that extracted data conforms to expected formats or meets certain quality standards.

The execution of a Data Rule takes place during the Apply Rules activity. Data Rules can be applied at different scopes such as each individual type of "Data Element". The rule can be set to execute conditionally based on a Trigger expression. If the Trigger evaluates to true, the Data Rule's True Action is applied, and if false, its False Action is executed. Data Rules can recursively apply logic to the hierarchy of data within a document instance, enabling complex data transformation and normalization operations that reflect the structure of the extracted data.

Overall, Data Rules in Grooper simplify extractors by separating the data normalization logic from the extraction logic, allowing for flexible and powerful post-extraction data processing .

Lexicon

Lexicons in Grooper are dictionary objects that store a list of keys or key-value pairs. They serve as resources for various functionalities within Grooper, such as establishing lists of words, phrases, field values, translations, weightings, and other forms of information relevant to document processing. Lexicons can define local entries, import entries from other Lexicons. Lexicons can even import entries using a Data Connection. The entries in a Lexicon can be utilized in different areas of Grooper, such as data extraction, fuzzy weightings, or OCR repair, providing a reference point that enhances the accuracy and consistency of the software's operations

Object Library

Object Libraries in Grooper are .NET libraries that contain code files for customizing the functionality of Grooper. These libraries are used for a range of customization and integration tasks, allowing users to extend Grooper's capabilities by adding:

  • custom activities that execute within Batch Processes
  • creating custom commands available during the Review activity
  • defining custom methods that can be called from expressions on Data Field and Batch Process Step objects
  • establish custom services that perform automated background tasks at regular intervals

Resource File

A Resource File object in Grooper is essentially a file that is stored as part of a Grooper Project. It can include various types of files such as text files or XML schema files. Resource File objects are created by dragging and dropping a file onto a Project object within Grooper. These files become part of the Project's resources and can be referenced and utilized throughout the project for various purposes such as:

  • defining data structures
  • storing CSS style sheets used by multiple Content Models
  • keep Project notes in Grooper
  • providing scripts
  • supplying any other additional information required during the processing of documents