Object Nomenclature (Concept): Difference between revisions
added information to "Extractor Objects" section // via Wikitext Extension for VSCode |
Dgreenwood (talk | contribs) |
||
| Line 141: | Line 141: | ||
* '''Data Type''' | * '''Data Type''' | ||
* '''Field Class''' | * '''Field Class''' | ||
All three of these objects perform a similar function | All three of these objects perform a similar function. They are objects that are configured to return data from documents. However, they differ in their configuration and data extraction purpose. | ||
Extractor Objects are tools to extract data. Ultimately, Data Elements are what collects data. They may ''use'' extractor objects to help collect data in a Data Model. | |||
To that end, extractor objects serve three purposes: | |||
# To be re-usable units of extraction | |||
# To collate data. | |||
# To leverage machine learning algorithms to target data in the flow of text. | |||
"Extractor Objects" should not be confused with "Extractor Types". There are ''many'' places in '''Grooper''' where extraction logic can be applied for one purpose or another. In | === Re-Usability === | ||
"Extractor Objects" are meant to be referenced either by other "Extractor Objects", or more importantly, by Data Elements. For example, an individual '''Data Field''' can be configured on its own to collect a date value, such as the "Received Date" on an invoice. However, what if another '''Data Field''' is collectig a different date format, like the "Due Date" on the same invoice? In this case you would create one "Extractor Object", like a '''Value Reader''', to collect any and all date formats. You could then have each '''Data Field''' reference that ''one'' '''Value Reader''' and further configure each individual '''Data Field''' to differentiate their specific date value. | |||
=== Data Collation === | |||
Another example would be configuring a '''Data Type''' to target entire rows of information within a table of data. Several '''Value Reader''' "Extractor Objects" could be made as children of the '''Data Type''', each targeting a specific value within the table row. The parent '''Data Type''' would then collate the results of its child '''Value Reader''' "Extractor Objects" into one result. A '''Data Table''' would then reference the '''Data Type''' to collect the appropriate rows of information. | |||
=== Machine Learning === | |||
=== Extractor Objects vs Extracto | |||
"Extractor Objects" should not be confused with "Extractor Types". There are ''many'' places in '''Grooper''' where extraction logic can be applied for one purpose or another. In these cases an "Extractor Type" is chosen to define the logic required to return a desired value. In fact, the "Extractor Objects" themselves each leverage specific "Extractor Types" to define their logic. | |||
* For example, Pattern-Match uses regex to return results. | |||
* For example, Labeled OMR uses a regex and computer vision to return results for checkboxes. | |||
* Other Extractor Types may use a combination of Extractor Types that work together to return results in specific ways. | |||
However, "Extractor Objects" are used when you need to ''reference'' them for their designated strengths (re-usbaility, collation, or machine learning). | |||
=== Related Objects === | === Related Objects === | ||
==== Value Reader ==== | ==== Value Reader ==== | ||
A '''Value Reader''' defines a single extraction operation. You set the type of extractor | A '''Value Reader''' defines a single extraction operation. You set the type of extractor on the '''Value Reader''' that matches the specific data you're aiming to capture. For example, you would use the Pattern-Match extractor type to return data using regular expression. You would use a Value Reader when you need to extract a single result or list of simple results from a document. | ||
==== Data Type ==== | ==== Data Type ==== | ||
A '''Data Type''' in '''Grooper''' holds a collection of extractors and settings that manage how multiple matches from extractors are consolidated into a result set. For example, if you're extracting a date that could appear in multiple formats within a document, you'd use various child extractors (each capturing a different format) | A '''Data Type''' in '''Grooper''' holds a collection of extractors and settings that manage how multiple matches from extractors are consolidated into a result set. | ||
For example, if you're extracting a date that could appear in multiple formats within a document, you'd use various child extractors (each capturing a different format). The '''Data Type''' would define how to collate those into a referenceable output. | |||
The simplest type of collation (Individual collation) would just return all individual extractors' results as a list of results. | |||
'''Data Types''' are also used for recognizing complex 2D data structures, like address blocks or table rows. Different collation methods would be used in these cases to combine results in different ways. | |||
==== Field Class ==== | ==== Field Class ==== | ||
'''Field Classes''' are trainable classifiers that | '''Field Classes''' are trainable classifiers that distinguish between multiple instances of similar data within a document by understanding the context in which they occur. '''Field Classes''' ''can'' be configured to distinguish values within highly structured documents, but this type of extraction is better suited to a simpler "Extractor Objects" like a '''Value Readers''' or '''Data Types'''. | ||
'''Field Classes''' are most useful when attempting to find values within the flow of natural language. This method involves training with positive and negative examples to distinguish the right context. You'd opt for a '''Field Class''' when the value you're after is an entire clause within a contract, or a specific value defined within the flow of text. | |||
== Connection Objects == | == Connection Objects == | ||
Revision as of 11:27, 14 March 2024

Mastery of a Grooper environment is greately enhanced by understanding the myriad of objects that can exist and how they are related.
About
In Grooper, understanding the objects within the platform involves recognizing how various elements can serve similar functions and therefore be grouped together based on their shared functionalities. This concept stems from the recognition that disparate objects often perform analogous tasks, albeit with differing characteristics or representations.
By discerning commonalities in functionality across diverse objects, users can streamline their approach to data processing and analysis within Grooper. Rather than treating each object in isolation, users can categorize them based on their functional similarities, thus simplifying management and enhancing efficiency.
This approach fosters a more holistic understanding of the data ecosystem within Grooper, empowering users to devise more effective strategies for data extraction, classification, and interpretation. By recognizing the underlying functional relationships between objects, users can optimize workflows, improve accuracy, and derive deeper insights from their data.
- Batch Objects
- Content Type Objects
- Data Element Objects
- Extractor Objects
- Connection Objects
- Profile Objects
- Queue Objects
- Process Objects
- Architecture Objects
- Miscellaneous Objects
Batch Objects
In Grooper, "Batch Objects" represent the hierarchical structure of documents being processed and consist of:
- Batch ...
- Batch Folder and ...
- Batch Page objects ...
... each serving a distinct function within this hierarchy but also being fundamentally related.
The relationship between these objects is hierarchical in nature. The Batch object is the top level. It contains:
- Batch Folders and ...
- Batch Pages
Batch Folders may contain either further Batch Folders (to represent subfolders or grouped documents) or Batch Pages (to represent individual pages of documents). This structured approach allows Grooper to efficiently manage and process documents at various levels of granularity — from a full batch down to individual pages.
Related Objects
Batch
The Batch object is a fundamental construct in Grooper's architecture as it encompasses the documents that are grouped together to be processed through Grooper's workflow mechanisms, following the steps dictated by the related Batch Process.
Batch Folder
A Batch Folder in Grooper is defined as a container object within a Batch that is used to represent and organize both folders and documents. It can hold other Batch Folders or Batch Page objects as children. The Batch Folder acts as an organizational unit within a Batch, allowing for a structured approach to managing and processing a collection of documents.
Batch Page
A Batch Page object in Grooper represents an individual page within a Batch. The Batch Page object is the most granular unit in the hierarchy of "Batch Objects" in Grooper. It is created in one of two ways:
- Physical pages can be acquired in Grooper by scanning them via the Grooper Desktop application.
- Digital documents are acquired in Grooper as whole objects and represented as Batch Folders. Applying the Split Pages activity on a Batch Folder that represents a digital document will expose Batch Page objects as direct children.
Batch Pages allow Grooper to process and store information at the page level, which is essential for operations that include Image Processing and recognition of text (see Recognize). They enable the system to manage and process each page independently. This is critical for workflows that require detailed page-specific actions or for Batches composed of documents with different processing requirements per page.
Content Type Objects
In Grooper, the "Content Type Objects" consist of:
- Content Model ...
- Content Category and ...
- Document Type objects.
Each of these objects serves a distinct function within Grooper's content classification and are related to each other through hierarchical relationships.
The relationship between these objects is established through a heirarchical inheritance system. Content Categories and Document Types are building blocks within a Content Model seen as the "tree". Content Categories act as the "branches". Document Types are the "leaves" of the hierarchy.
Data Elements can be defined on each Content Type object and are inherited down the "tree" of heirachy.
- Data Elements defined at the Content Model level are applied to all Content Types within the Content Model.
- Data Elements defined at the Content Category level are applied to all Content Types that exist within that specific "branch".
- Data Elements defined on a Document Type will apply to that specific "leaf".
These "Content Type Objects" work together in Grooper to enable sophisticated document processing workflows. With different types of documents properly classified, they can have their data extracted and be handled according to the rules and behaviors defined by their respective Document Types within a Content Model hierarchy.
Related Objects
Content Model
A Content Model defines the taxonomy of document sets in terms of the Document Types it contains. It also defines the Data Elements that appear on each Content Category and Document Type. Content Models serve as the root of a Content Type hierarchy and are crucial for organizing the different types of documents that Grooper can recognize and process.
Content Category
A Content Category is a container within a Content Model that holds other Content Categories and Document Type objects. It allows for further classification and grouping of Document Types within a taxonomy, aiding in the logical structuring of complex document sets. Besides grouping Document Types together, Content Categories also serve to create new branches in a Data Element hierarchy.
Document Type
A Document Type represents a distinct type of document, like an invoice or contract. Document Types are created as children of a Content Model or a Content Category and are used to classify individual documents. Each Document Type in the hierarchy defines the Data Elements and Behaviors that apply to documents of that specific classification.
Data Element Objects
The "Data Element Objects" within Grooper consist of:
- Data Field ...
- Data Section ...
- Data Table and ...
- Data Column objects.
Each of these objects has its own function within the data capture and organization framework. These objects are, however, all interconnected within Grooper's data extraction architecture.
The relationship between these "Data Element Objects" is hierarchical and modular. The Data Model acts as the overall blueprint for data extraction. Data Sections structure the document into logical parts. Data Tables are incorporated into the model to handle tabular data. Each Data Table comprises Data Columns which specify the format and rules for columnar data extraction. Finally, Data Fields are the fundamental units of data of any kind representing individual pieces of non-repeated data within a document. The exception to this is when Data Fields are contained within a Data Section that occurs repeatedly within a document. The Data Model ties these elements together, dictating the inheritance of properties and the flow of data extraction processes
Related Objects
Data Model
A Data Model serves as the top-tier structure defining the taxonomy for data elements and is leveraged during the Extract activity to extract data from a Batch. It is a hierarchy of Data Elements that sets the stage for the organization, extraction logic, and review behavior of data collected from documents.
Data Field
A Data Field is designed to capture a single piece of information from a document, such as a name or date, which is a fundamental data point required from the content.
Data Section
A Data Section is a grouping mechanism for related Data Fields. Data Sections organize and segment them into logical divisions of a document based on the structure and semantics of the information the documents contain.
Data Table
A Data Table is utilized for extracting repeating data that's formatted in rows and columns, allowing for complex multi-instance data organization that would be present in table-formatted content.
Data Column
A Data Column is a child object of a Data Table, representing individual columns and defining the type of data each column holds along with its data extraction properties.
Extractor Objects
There are three types of "Extractor Objects" in Grooper:
- Value Reader
- Data Type
- Field Class
All three of these objects perform a similar function. They are objects that are configured to return data from documents. However, they differ in their configuration and data extraction purpose.
Extractor Objects are tools to extract data. Ultimately, Data Elements are what collects data. They may use extractor objects to help collect data in a Data Model.
To that end, extractor objects serve three purposes:
- To be re-usable units of extraction
- To collate data.
- To leverage machine learning algorithms to target data in the flow of text.
Re-Usability
"Extractor Objects" are meant to be referenced either by other "Extractor Objects", or more importantly, by Data Elements. For example, an individual Data Field can be configured on its own to collect a date value, such as the "Received Date" on an invoice. However, what if another Data Field is collectig a different date format, like the "Due Date" on the same invoice? In this case you would create one "Extractor Object", like a Value Reader, to collect any and all date formats. You could then have each Data Field reference that one Value Reader and further configure each individual Data Field to differentiate their specific date value.
Data Collation
Another example would be configuring a Data Type to target entire rows of information within a table of data. Several Value Reader "Extractor Objects" could be made as children of the Data Type, each targeting a specific value within the table row. The parent Data Type would then collate the results of its child Value Reader "Extractor Objects" into one result. A Data Table would then reference the Data Type to collect the appropriate rows of information.
Machine Learning
=== Extractor Objects vs Extracto "Extractor Objects" should not be confused with "Extractor Types". There are many places in Grooper where extraction logic can be applied for one purpose or another. In these cases an "Extractor Type" is chosen to define the logic required to return a desired value. In fact, the "Extractor Objects" themselves each leverage specific "Extractor Types" to define their logic.
- For example, Pattern-Match uses regex to return results.
- For example, Labeled OMR uses a regex and computer vision to return results for checkboxes.
- Other Extractor Types may use a combination of Extractor Types that work together to return results in specific ways.
However, "Extractor Objects" are used when you need to reference them for their designated strengths (re-usbaility, collation, or machine learning).
Related Objects
Value Reader
A Value Reader defines a single extraction operation. You set the type of extractor on the Value Reader that matches the specific data you're aiming to capture. For example, you would use the Pattern-Match extractor type to return data using regular expression. You would use a Value Reader when you need to extract a single result or list of simple results from a document.
Data Type
A Data Type in Grooper holds a collection of extractors and settings that manage how multiple matches from extractors are consolidated into a result set.
For example, if you're extracting a date that could appear in multiple formats within a document, you'd use various child extractors (each capturing a different format). The Data Type would define how to collate those into a referenceable output.
The simplest type of collation (Individual collation) would just return all individual extractors' results as a list of results.
Data Types are also used for recognizing complex 2D data structures, like address blocks or table rows. Different collation methods would be used in these cases to combine results in different ways.
Field Class
Field Classes are trainable classifiers that distinguish between multiple instances of similar data within a document by understanding the context in which they occur. Field Classes can be configured to distinguish values within highly structured documents, but this type of extraction is better suited to a simpler "Extractor Objects" like a Value Readers or Data Types.
Field Classes are most useful when attempting to find values within the flow of natural language. This method involves training with positive and negative examples to distinguish the right context. You'd opt for a Field Class when the value you're after is an entire clause within a contract, or a specific value defined within the flow of text.
Connection Objects
Related Objects
CMIS Connection
CMIS Repository
Data Connection
Profile Objects
Related Objects
IP Profile
IP Group
IP Step
OCR Profile
Scanner Profile
Separation Profile
Queue Objects
Related Objects
Procssing Queue
Review Queue
Process Objects
In Grooper Batch Process and Batch Process Step objects are closely related in managing and executing a sequence of steps designed to process a collection of documents known as a Batch.
A Batch Process consists of a series of Batch Process Steps meant to be executed in a particular sequence for a batch of documents. Before a Batch Process can be used in production, it must be "published", which creates a read-only copy in the "Processes" folder of the node tree, making it accessible for production purposes.
In essence, a Batch Process defines the overall workflow for processing documents, but it relies on Batch Process Steps to perform each action required during the process. Each Batch Process Step represents a discrete operation, or "activity", within the broader scope of the Batch Process. Batches Processes and Batch Process Steps work together to ensure that documents are handled in a consistent and controlled manner.
Related Objects
Batch Process
A Batch Process is a crucial component in Grooper's architecture. A Batch Process orchestrates the document processing strategy and ensures each batch of documents is managed systematically and efficiently.
Batch Process Step
A Batch Process Step is a specific action within the sequence defined by a Batch Process. A Batch Procses Step plays a critical role in automating and managing the flow of documents through the various stages of processing within Grooper.