Object Nomenclature (Concept): Difference between revisions

From Grooper Wiki
Line 99: Line 99:
== Extractor Objects ==
== Extractor Objects ==
There are three types of "Extractor Objects" in '''Grooper''':
There are three types of "Extractor Objects" in '''Grooper''':
* '''Value Reader'''
* [[image:GrooperIcon_ValueReader.png]] '''Value Reader'''
* '''Data Type'''
* [[image:GrooperIcon_DataType.png]] '''Data Type'''
* '''Field Class'''
* [[image:GrooperIcon_FieldClass.png]] '''Field Class'''
All three of these objects perform a similar function. They are objects that are configured to return data from documents. However, they differ in their configuration and data extraction purpose.
All three of these objects perform a similar function. They are objects that are configured to return data from documents. However, they differ in their configuration and data extraction purpose.


Line 136: Line 136:
=== Related Objects ===
=== Related Objects ===
==== Value Reader ====
==== Value Reader ====
A '''[[Value Reader]]''' defines a single extraction operation. You set the type of extractor on the '''Value Reader''' that matches the specific data you're aiming to capture. For example, you would use the ''Pattern-Match'' "Value Extractor" to return data using regular expression. You would use a '''Value Reader''' when you need to extract a single result or list of simple results from a document.
A [[image:GrooperIcon_ValueReader.png]] '''[[Value Reader]]''' defines a single extraction operation. You set the type of extractor on the '''Value Reader''' that matches the specific data you're aiming to capture. For example, you would use the ''Pattern-Match'' "Value Extractor" to return data using regular expression. You would use a '''Value Reader''' when you need to extract a single result or list of simple results from a document.


==== Data Type ====
==== Data Type ====
A '''[[Data Type]]''' in '''Grooper''' holds a collection of extractors and settings that manage how multiple matches from extractors are consolidated into a result set.
A [[image:GrooperIcon_DataType.png]] '''[[Data Type]]''' in '''Grooper''' holds a collection of extractors and settings that manage how multiple matches from extractors are consolidated into a result set.
* For example, if you're extracting a date that could appear in multiple formats within a document, you'd use various "Extractor Objects" (each capturing a different format) as children of a '''Data Type'''.  
* For example, if you're extracting a date that could appear in multiple formats within a document, you'd use various "Extractor Objects" (each capturing a different format) as children of a '''Data Type'''.  


Line 147: Line 147:


==== Field Class ====
==== Field Class ====
A '''[[Field Class]]''' is a trainable extractor that distinguish between multiple instances of similar data within a document by understanding the context in which they occur. '''Field Classes''' ''can'' be configured to distinguish values within highly structured documents, but this type of extraction is better suited to a simpler "Extractor Objects" like a '''Value Readers''' or '''Data Types'''.  
A [[image:GrooperIcon_FieldClass.png]] '''[[Field Class]]''' is a trainable extractor that distinguish between multiple instances of similar data within a document by understanding the context in which they occur. '''Field Classes''' ''can'' be configured to distinguish values within highly structured documents, but this type of extraction is better suited to a simpler "Extractor Objects" like a '''Value Readers''' or '''Data Types'''.  


'''Field Classes''' are most useful when attempting to find values within the flow of natural language. This method involves training with positive and negative examples to distinguish the right context. You'd opt for a '''Field Class''' when the value you're after is an entire clause within a contract, or a specific value defined within the flow of text.
'''Field Classes''' are most useful when attempting to find values within the flow of natural language. This method involves training with positive and negative examples to distinguish the right context. You'd opt for a '''Field Class''' when the value you're after is an entire clause within a contract, or a specific value defined within the flow of text.

Revision as of 16:47, 27 March 2024

A Grooper environment consists of many interrelated objects.

Mastery of a Grooper environment is greately enhanced by understanding the myriad of objects that can exist and how they are related.

About

In Grooper, understanding the objects within the platform involves recognizing how various elements can serve similar functions and therefore be grouped together based on their shared functionalities. This concept stems from the recognition that disparate objects often perform analogous tasks, albeit with differing characteristics or representations.

By discerning commonalities in functionality across diverse objects, users can streamline their approach to data processing and analysis within Grooper. Rather than treating each object in isolation, users can categorize them based on their functional similarities, thus simplifying management and enhancing efficiency.

This approach fosters a more holistic understanding of the data ecosystem within Grooper, empowering users to devise more effective strategies for data extraction, classification, and interpretation. By recognizing the underlying functional relationships between objects, users can optimize workflows, improve accuracy, and derive deeper insights from their data.

High Level Overview

This article is meant to be a high level overview of all the objects in Grooper and how they're related. If you need more specific information on a particular object, please click the hyperlink for that specific object (as listed in the category's "Related Objects" section) to be taken to an article giving more informatoin on that object.

Batch Objects

In Grooper, "Batch Objects" represent the hierarchical structure of documents being processed and consist of:

  • Batch ...
  • Batch Folder and ...
  • Batch Page objects ...

... each serving a distinct function within this hierarchy but also being fundamentally related.

The relationship between these objects is hierarchical in nature. The Batch object is the top level. It contains:

  • Batch Folders and ...
  • Batch Pages

Batch Folders may contain either further Batch Folders (to represent subfolders or grouped documents) or Batch Pages (to represent individual pages of documents). This structured approach allows Grooper to efficiently manage and process documents at various levels of granularity — from a full batch down to individual pages.

Related Objects

Batch

The Batch object is a fundamental construct in Grooper's architecture as it encompasses the documents that are grouped together to be processed through Grooper's workflow mechanisms, following the steps dictated by the related Batch Process.

Batch Folder

A Batch Folder in Grooper is defined as a container object within a Batch that is used to represent and organize both folders and documents. It can hold other Batch Folders or Batch Page objects as children. The Batch Folder acts as an organizational unit within a Batch, allowing for a structured approach to managing and processing a collection of documents.

Batch Page

A Batch Page object in Grooper represents an individual page within a Batch. The Batch Page object is the most granular unit in the hierarchy of "Batch Objects" in Grooper. It is created in one of two ways:

  • Physical pages can be acquired in Grooper by scanning them via the Grooper Desktop application.
  • Digital documents are acquired in Grooper as whole objects and represented as Batch Folders. Applying the Split Pages activity on a Batch Folder that represents a digital document will expose Batch Page objects as direct children.

Batch Pages allow Grooper to process and store information at the page level, which is essential for operations that include Image Processing and recognition of text (see Recognize). They enable the system to manage and process each page independently. This is critical for workflows that require detailed page-specific actions or for Batches composed of documents with different processing requirements per page.

Content Type Objects

In Grooper, the "Content Type Objects" consist of:

  • Content Model ...
  • Content Category and ...
  • Document Type objects.

Each of these objects serves a distinct function within Grooper's content classification and are related to each other through hierarchical relationships.

The relationship between these objects is established through a heirarchical inheritance system. Content Categories and Document Types are building blocks within a Content Model seen as the "tree". Content Categories act as the "branches". Document Types are the "leaves" of the hierarchy.

"Data Elements" can be defined on each "Content Type Object" and are inherited down the "tree" of heirachy.

  • "Data Elements" defined at the Content Model level are applied to all "Content Types" within the Content Model.
  • "Data Elements" defined at the Content Category level are applied to all "Content Types" that exist within that specific "branch".
  • "Data Elements" defined on a Document Type will apply to that specific "leaf".

These "Content Type Objects" work together in Grooper to enable sophisticated document processing workflows. With different types of documents properly classified, they can have their data extracted and be handled according to the rules and behaviors defined by their respective Document Types within a Content Model hierarchy.

Related Objects

Content Model

A Content Model defines the taxonomy of document sets in terms of the Document Types it contains. It also defines the "Data Elements" that appear on each Content Category and Document Type. Content Models serve as the root of a "Content Type" hierarchy and are crucial for organizing the different types of documents that Grooper can recognize and process.

Content Category

A Content Category is a container within a Content Model that holds other Content Categories and Document Type objects. It allows for further classification and grouping of Document Types within a taxonomy, aiding in the logical structuring of complex document sets. Besides grouping Document Types together, Content Categories also serve to create new branches in a "Data Element" hierarchy.

Document Type

A Document Type represents a distinct type of document, like an invoice or contract. Document Types are created as children of a Content Model or a Content Category and are used to classify individual documents. Each Document Type in the hierarchy defines the "Data Elements" and Behaviors that apply to documents of that specific classification.

Data Element Objects

The "Data Element Objects" within Grooper consist of:

  • Data Model ...
  • Data Field ...
  • Data Section ...
  • Data Table and ...
  • Data Column objects.

Each of these objects has its own function within the data capture and organization framework. These objects are, however, all interconnected within Grooper's data extraction architecture.

The relationship between these "Data Element Objects" is hierarchical and modular.

  • The Data Model acts as the overall blueprint for data extraction.
  • Data Sections structure the document into logical parts. Data Sections can also serve as simple organizational objects within a Data Model to bucket similar "Data Elements" together.
  • Data Tables are incorporated into the model to handle tabular data. Each Data Table comprises Data Columns which specify the format and rules for columnar data extraction.
  • Finally, Data Fields are the fundamental units of data of any kind representing individual pieces of non-repeated data within a document. The exception to this is when Data Fields are contained within a "multi instance" Data Section that occurs repeatedly within a document.

Related Objects

Data Model

A Data Model serves as the top-tier structure defining the taxonomy for "Data Elements" and is leveraged during the Extract activity to extract data from a Batch. It is a hierarchy of "Data Elements" that sets the stage for the organization, extraction logic, and review behavior of data collected from documents.

Data Field

A Data Field is designed to capture a single piece of information from a document, such as a name or date, which is a fundamental data point required from the content.

Data Section

A Data Section is a grouping mechanism for related Data Fields. Data Sections organize and segment them into logical divisions of a document based on the structure and semantics of the information the documents contain.

Data Table

A Data Table is utilized for extracting repeating data that's formatted in rows and columns, allowing for complex multi-instance data organization that would be present in table-formatted content.

Data Column

A Data Column is a child object of a Data Table, representing individual columns and defining the type of data each column holds along with its data extraction properties.

Extractor Objects

There are three types of "Extractor Objects" in Grooper:

  • Value Reader
  • Data Type
  • Field Class

All three of these objects perform a similar function. They are objects that are configured to return data from documents. However, they differ in their configuration and data extraction purpose.

"Extractor Objects" are tools to extract/return data. Ultimately, "Data Elements" are what collects data. They may use extractor objects to help collect data in a Data Model.

To that end, extractor objects serve three purposes:

  1. To be re-usable units of extraction
  2. To collate data
  3. To leverage machine learning algorithms to target data in the flow of text

Re-Usability

"Extractor Objects" are meant to be referenced either by other "Extractor Objects", or more importantly, by "Data Elements". For example, an individual Data Field can be configured on its own to collect a date value, such as the "Received Date" on an invoice. However, what if another Data Field is collectig a different date format, like the "Due Date" on the same invoice? In this case you would create one "Extractor Object", like a Value Reader, to collect any and all date formats. You could then have each Data Field reference that one Value Reader and further configure each individual Data Field to differentiate their specific date value.

Data Collation

Another example would be configuring a Data Type to target entire rows of information within a table of data. Several Value Reader "Extractor Objects" could be made as children of the Data Type, each targeting a specific value within the table row. The parent Data Type would then collate the results of its child Value Reader "Extractor Objects" into one result. A Data Table would then reference the Data Type to collect the appropriate rows of information.

Machine Learning

Many documents contain important pieces of information buried within the flow of text, like a legal document. These types of documents and the data they contain require an entirely different approach to extracting data than a highly structured document like an invoice. For these situations you can use a "trainable" "Extractor Object" known as a Field Class to leverage machine learning algorithms to target important information.

Extractor Objects vs Value Extractors

"Extractor Objects" should not be confused with "Value Extractors". There are many places in Grooper where extraction logic can be applied for one purpose or another. In these cases a "Value Extractor" is chosen to define the logic required to return a desired value. In fact, the "Extractor Objects" themselves each leverage specific "Value Extractors" to define their logic.

"Value Extractor" examples:

  • Pattern-Match uses regular expressions to return results.
  • Labeled OMR uses a regex and computer vision to return results for checkboxes.
  • Other "Value Extractors" may use a combination of "Value Extractors" that work together to return results in specific ways.
    • The Labeled Value "Value Extractor" defines a "Value Extractor" for both its Label Extractor and Value Extractor properties.

However, "Extractor Objects" are used when you need to reference them for their designated strengths:

  • re-usbaility
  • collation
  • machine learning

Related Objects

Value Reader

A Value Reader defines a single extraction operation. You set the type of extractor on the Value Reader that matches the specific data you're aiming to capture. For example, you would use the Pattern-Match "Value Extractor" to return data using regular expression. You would use a Value Reader when you need to extract a single result or list of simple results from a document.

Data Type

A Data Type in Grooper holds a collection of extractors and settings that manage how multiple matches from extractors are consolidated into a result set.

  • For example, if you're extracting a date that could appear in multiple formats within a document, you'd use various "Extractor Objects" (each capturing a different format) as children of a Data Type.

The Data Type also defines how to collate results from one or more extractors into a referenceable output. The simplest type of collation (Individual) would just return all individual extractors' results as a list of results.

Data Types are also used for recognizing complex 2D data structures, like address blocks or table rows. Different collation methods would be used in these cases to combine results in different ways.

Field Class

A Field Class is a trainable extractor that distinguish between multiple instances of similar data within a document by understanding the context in which they occur. Field Classes can be configured to distinguish values within highly structured documents, but this type of extraction is better suited to a simpler "Extractor Objects" like a Value Readers or Data Types.

Field Classes are most useful when attempting to find values within the flow of natural language. This method involves training with positive and negative examples to distinguish the right context. You'd opt for a Field Class when the value you're after is an entire clause within a contract, or a specific value defined within the flow of text.

Connection Objects

In Grooper, "Connection Objects" play a vital role in integrating external data sources and repositories. They consist of:

  • CMIS Connection ...
  • CMIS Repository and ...
  • Data Connection objects.

Each of these objects serve a unique purpose while also being related through their collaborative use in connecting and managing data across various platforms and databases.

These Connection Objects are related in their collective ability to bridge Grooper with external data sources and content repositories.

  • The CMIS Connection object serves as the gateway to multiple content management systems.
  • The CMIS Repository object uses this connection to organize and manage document access for those systems.
  • The Data Connection object links Grooper to databases, allowing it to perform data lookups and synchronize with external structured data sources.

Together these Connection Objects enable Grooper to extend its data processing capabilities beyond its local domain and integrate seamlessly with external systems for end-to-end document and data management.

Related Objects

CMIS Connection

The CMIS Connection in Grooper provides a standardized way of connecting to various content management systems (CMS).

  • For those that support the CMIS standard, the CMIS Connection connects to the CMS using the CMIS standard.
  • For those that do not, the CMIS Connection normalizes connection and transfer protocol as if they were a CMIS platform.

This object allows Grooper to communicate with multiple external storage platforms, enabling access to documents and content that reside outside of Grooper's immediate environment.

CMIS Repository

A CMIS Repository represents a logical container for documents on an external storage platform that is accessed via a CMIS Connection. This object facilitates the organization and retrieval of documents stored in a CMIS-compliant repository, enabling Grooper to work with documents as if they were within its local infrastructure . It is created as a "child" of the CMIS Connection object via "Import" button found in the top-right of the UI after successfully configuring the CMIS Connection object and creating a connection to its destination. The CMIS Repository object is referenced for lookups, CMIS Import, and the Export activity.

Data Connection

A Data Connection defines the settings necessary to establish connectivity with a database. It holds the configuration details required for connecting to and interacting with a database. These interactions may include conducting lookups, exports, or other actions that relate to database management systems (DBMS). Once configured, a Data Connection object can be referenced by other components in Grooper for various DBMS-related activities.

Profile Objects

"Profile Objects" in Grooper serve as pre-configured settings templates used across various stages of document processing, such as scanning, image cleanup, and document separation. These objects, which include:

  • OCR Profile ...
  • Scanner Profile ...
  • Separation Profile and ...
  • IP Profile ...

... have their own individual functions but are also related by defining structured approaches to handling documents within Grooper.

By creating distinct profiles for each aspect of the document processing pipeline, Grooper allows for customization and optimization of each step. This standardizes settings across similar document types or processing requirements, which can contribute to consistency and efficiency in processing tasks. These "Profile Objects" collectively establish a comprehensive, repeatable, and optimized workflow for processing documents from the point of capture to the point of data extraction.

Related Objects

IP Profile

An Image Processing (IP) Profile details the operations and parameters for image enhancement and cleanup. These operations improve the accuracy of further processing steps, like OCR for the Recognize activity, or classification.

IP Group

An IP Group is a subsidiary object within an IP Profile that creates a hierarchical structure for organizing image processing commands. IP Groups may contain other IP Groups or IP Step objects.

IP Step

An IP Step is the basic unit within an IP Profile that defines a single image processing operation. IP Steps are performed sequentially within their parent IP Group or IP Profile.

OCR Profile

An OCR Profile configures the settings for optical character recognition (OCR) leveraged by the Recognize activity. "OCR" converts images of text into machine-encoded text. OCR Profile objects influence how effectively textual content is recognized and extracted from document images.

Scanner Profile

A Scanner Profile outlines the specifications for scanning physical documents into digital forms. This includes settings like resolution, color mode, and any post-scan image processing or enhancement functions.

See Desktop Scanning in Grooper for more information.

Separation Profile

Separation Profiles contain rules and settings that determine how batches of scanned pages are separated into individual documents or sections, often using barcodes, blank pages, or patch codes as indicators for separation points.

Queue Objects

"Queue Objects" in Grooper are structures designed to manage and distribute tasks within the document processing workflow. There are two main types of queues:

  • Processing Queue and ...
  • Review Queue ...

... each with a distinct function but inherently interconnected as they both coordinate the flow of work through Grooper.

Related Objects

Processing Queue

A Processing Queue is designed for tasks performed by machines, which include automated steps in the document processing lifecycle. Processing Queues are used to distribute machine tasks among different servers and control the concurrency or processing rate of these tasks.

  • For example, activities such as rendering documents or exporting data can be managed so that only one activity instance runs per machine or so multiple instances are processed concurrently, according to the queue configuration.

Review Queue

A Review Queue is designated for human-performed tasks. It organizes the Review tasks that require human attention and can distribute these tasks among different groups of users based on the queue's settings. Review Queues can be assigned on the Batch Process level to filter work by an entire process or Review activities at the Batch Process Step level to filter tasks at a more granular step-based level.

Processing Queues vs Review Queues

The relationship between Processing Queues and Review Queues lies in their roles in managing the workflow and task distribution in Grooper. Both facilitate the progression of document processing from automatic operations to those requiring human intervention.

  • Processing Queues handle the automation side of the operation, ensuring that machine tasks are efficiently allocated across the available resources.
  • Review Queues oversee the user-driven aspects of the workflow, particularly quality control and verification processes that require manual input.

Together, these queues ensure a smooth transition between automated and manual stages of document processing and help maintain order and efficiency within the system.

Process Objects

In Grooper Batch Process and Batch Process Step objects are closely related in managing and executing a sequence of steps designed to process a collection of documents known as a Batch.

A Batch Process consists of a series of Batch Process Steps meant to be executed in a particular sequence for a batch of documents. Before a Batch Process can be used in production, it must be "published". Publishing a Batch Process will create a read-only copy in the "Processes" folder of the node tree, making it accessible for production purposes.

In essence, a Batch Process defines the overall workflow for processing documents. It relies on Batch Process Steps to perform each action required during the process. Each Batch Process Step represents a discrete operation, or "activity", within the broader scope of the Batch Process. Batches Processes and Batch Process Steps work together to ensure that documents are handled in a consistent and controlled manner.

Related Objects

Batch Process

A Batch Process is a crucial component in Grooper's architecture. A Batch Process orchestrates the document processing strategy and ensures each batch of documents is managed systematically and efficiently.

Batch Processes by themselves do nothing. Instead, the workflows they execute are designed by adding child Batch Process Steps.

Batch Process Step

A Batch Process Step is a specific action within the sequence defined by a Batch Process. A Batch Procses Step plays a critical role in automating and managing the flow of documents through the various stages of processing within Grooper.

Architecture Objects

In Grooper, "Architecture Objects" organize and oversee the infrastructure and framework of the Grooper repository. A "Grooper Repository" is a tree structure of nodes representing both configuration and content objects. These objects include the...

... each with distinct roles but also working in conjunction to manage resources and information flow within the repository.

The relationship among these "Architecture Objects" is foundational to the operation and scalability of Grooper's document processing capabilities.

  • The Root object provides a base structure.
  • The Project object defines the processing and design resources.
  • The Filestore offers a storage utility for files and content.
  • The Machine objects represent the hardware resources for performing processing tasks.

Together, they comprise the essential components that underpin the function and manageability of the Grooper ecosystem.

Related Objects

Root

The Root object represents the topmost element of the Grooper repository. It serves as the starting point from which all other objects branch out. It is the anchor point for all other structures within the repository and a necessary element for the organization and linkage of all other objects within Grooper.

Project

A Project is a collection of resources and serves as a primary container for design components within Grooper. The Project object is where various processing objects such as Content Models, Batch Processes, "Profile Objects", and more are organized and managed. It allows for the encapsulation and modularization of these resources for easier management and reusability.

Filestore

The File Store is a storage location within Grooper where file content associated with nodes is saved. It's crucial for managing the content that forms the basis of the Grooper's processing tasks, allowing for the storage and retrieval of documents, images, and other "files".

Machine

A Machine represents a server that has connected to the Grooper repository. It allows for the management of Grooper Service Instances and serves as a connection point for processing jobs to be executed on the server hardware. Machine objects are essential for the scaling of processing capabilities and for distributing processing loads across multiple servers

Miscellaneous Objects

The following ojbects are related only in that they don't fit neatly into the groups defined above in this article.

Related Objects

Lexicon

A Lexicon in Grooper is a dictionary object that stores a list of keys or key-value pairs. It serves as a resource for various functionalities within Grooper, such as establishing lists of words, phrases, field values, translations, weightings, and other forms of information relevant to document processing. Lexicons can define local entries, import entries from other Lexicons. Lexicons can even import entries using a Data Connection. The entries in a Lexicon can be utilized in different areas of Grooper, such as data extraction, fuzzy weightings, or OCR repair, providing a reference point that enhances the accuracy and consistency of the software's operations

Data Rule

A Data Rule object in Grooper defines the logic for automated data manipulation which occurs after data has been extracted from documents. These rules are applied to normalize or otherwise prepare data for downstream processes. Data Rules ensure that extracted data conforms to expected formats or meets certain quality standards.

The execution of a Data Rule takes place during the Apply Rules activity. Data Rules can be applied at different scopes such as each individual type of "Data Element". The rule can be set to execute conditionally based on a Trigger expression. If the Trigger evaluates to true, the Data Rule's True Action is applied, and if false, its False Action is executed. Data Rules can recursively apply logic to the hierarchy of data within a document instance, enabling complex data transformation and normalization operations that reflect the structure of the extracted data.

Overall, Data Rules in Grooper simplify extractors by separating the data normalization logic from the extraction logic, allowing for flexible and powerful post-extraction data processing .


Resource File

A Resource File object in Grooper is essentially a file that is stored as part of a Grooper Project. It can include various types of files such as text files or XML schema files. Resource File objects are created by dragging and dropping a file onto a Project object within Grooper. These files become part of the Project's resources and can be referenced and utilized throughout the project for various purposes such as:

  • defining data structures
  • storing CSS style sheets used by multiple Content Models
  • keep Project notes in Grooper
  • providing scripts
  • supplying any other additional information required during the processing of documents

Object Library

An Object Library in Grooper is a .NET library that contains code files for customizing the functionality of Grooper. This library is used for a range of customization and integration tasks, allowing users to extend Grooper's capabilities by adding:

  • custom activities that execute within Batch Processes
  • creating custom commands available during the Review activity
  • defining custom methods that can be called from expressions on Data Field and Batch Process Step objects
  • establish custom services that perform automated background tasks at regular intervals

Control Sheet

A Control Sheet in Grooper is a special page used to control various aspects of the document scanning process. Control Sheets can serve multiple functions such as:

  • separating and classifying documents
  • changing image settings dynamically
  • create a new folder with specific Content Types
  • trigger other actions that affect how documents are handled as they pass through the scanning equipment

Control sheets are pre-printed with barcodes or other markers that Grooper recognizes and uses to perform specific actions based on the presence of the sheet. For instance, when a control sheet instructs the creation of a new folder it can influence the hierarchy within a batch. This enables the management and organization of documents without manual intervention during the Scan activity.

Overall, Control Sheets are an intelligent way to guide the scanning workflow. Control Sheets can ensure that batches of documents are organized and processed according to predefined rules, thereby automating the structuring of scanned content into logical units within Grooper.