Data Model (Node Type)

From Grooper Wiki
Revision as of 11:05, 26 August 2024 by Randallkinard (talk | contribs)

This article is about the current version of Grooper.

Note that some content may still need to be updated.

2025

STUB

This article is a stub. It contains minimal information on the topic and should be expanded.

data_table Data Models are leveraged during the Extract activity to collect data from documents (folder Batch Folders). Data Models are the root of a Data Element hierarchy. The Data Model and its child Data Elements define a schema for data present on a document. The Data Model's configuration (and its child Data Elements' configuration) define data extraction logic and settings for how data is reviewed in a Data Viewer.

About

The Data Model defines the data structure for a Content Type and can live at varying levels of structure, allowing for inheritance if a hierarchy exists.  This can be a simple list of data fields or a complex hierarchy of sections, subsections, tables and fields.  

The Data Model is leveraged by Grooper to extract data from a Batch.  All extraction logic (i.e. referencing a Data Extractor to fill a field, performing a database lookup, or generating a calculated field expression) is set on the Data Model or the Data Elements related to the Data Model.  It also provides information to the Data Review activity setting expectations for field appearance and behavior (i.e. if a field is required before completing batch validation).  

One Data Model can be created for each:

Data Models also inherit data elements from parent Content Types.  For example, if a Content Model's Data Model has a child Data Field named "Date" and a Content Category's Data Model has a child Data Field named "Time", the Content Category's Data Model will actually have both "Date" and "Time" as fields.  It has it's child field "Time" and inherits the parent field "Date" as well. See below for a typical hierarchical structure exemplifying such:

  • Content Model - HR
    • Data Model - HR
      • Data fields such as: First Name, Middle Name, Last Name, Employment Status, Status Date
      • Content Category - Benefits
        • Data Model - Benefits (Inherits all data from the Content Model's primary Data Model as well extracting its own data such as...)
          • Data Fields: Eligible Date
        • Document Type - Health Insurance
          • Data Model - Health Insurance (Inherits all data from the Content Model and parent Content Category as well as extracting its own data such as...)
            • Data Fields: Enrolled Date, Covered Parties

So, a document classified as a "Health Insurance" Document Type would have eight total Data Fields: Two from its own Data Model (Enrolled Date and Covered Parties), One from its parent Content Category's (named "Benefits") Data Model (Eligible Date), and five from the Content Model's Data Model (First Name, Middle Name, Last Name, Employment Status, Status Date).

Data context can be critical to build the Data Type and Field Class extractors to populate a Data Model. For more information on this topic, visit the Data Context article.

Glossary

Activity: Grooper Activities define specific document processing operations done to a inventory_2 Batch, folder Batch Folder, or contract Batch Page. In a settings Batch Process, each edit_document Batch Process Step executes a single Activity (determined by the step's "Activity" property).

  • Batch Process Steps are frequently referred by the name of their configured Activity followed by the word "step". For example: "Classify step".

Batch Folder: The folder Batch Folder is an organizational unit within a inventory_2 Batch, allowing for a structured approach to managing and processing a collection of documents. Batch Folder nodes serve two purposes in a Batch. (1) Primarily, they represent "documents" in Grooper. (2) They can also serve more generally as folders, holding other Batch Folders and/or contract Batch Page nodes as children.

  • Batch Folders are frequently referred to simply as "documents" or "folders" depending on how they are used in the Batch.

Batch: inventory_2 Batch nodes are fundamental in Grooper's architecture. They are containers of documents that are moved through workflow mechanisms called settings Batch Processes. Documents and their pages are represented in Batches by a hierarchy of folder Batch Folders and contract Batch Pages.

Content Category: collections_bookmark A Content Category is a container for other Content Category or description Document Type nodes in a stacks Content Model. Content Categories are often used simply as organizational buckets for Content Models with large numbers of Document Types. However, Content Categories are also necessary to create branches in a Content Model's classification taxonomy, allowing for more complex Data Element inheritance and Behavior inheritance.

Content Model: stacks Content Model nodes define a classification taxonomy for document sets in Grooper. This taxonomy is defined by the collections_bookmark Content Categories and description Document Types they contain. Content Models serve as the root of a Content Type hierarchy, which defines Data Element inheritance and Behavior inheritance. Content Models are crucial for organizing documents for data extraction and more.

Content Type: Content Types are a class of node types used used to classify folder Batch Folders. They represent categories of documents (stacks Content Models and collections_bookmark Content Categories) or distinct types of documents (description Document Types). Content Types serve an important role in defining Data Elements and Behaviors that apply to a document.

Data Context: Data Context refers to contextual information used to extract data, such as a label that identifies the value you want to collect.

Data Element: Data Elements are a class of node types used to collect data from a document. These include: data_table Data Models, insert_page_break Data Sections, variables Data Fields, table Data Tables, and view_column Data Columns.

Data Extractor: Data Extractor (or just "extractor") refers to all Value Extractors and Extractor Nodes. Extractors define the logic used to return data from a document's text content, including general data (such as a date) and specific data (such as an agreement date on a contract).

Data Field: variables Data Fields represent a single value targeted for data extraction on a document. Data Fields are created as child nodes of a data_table Data Model and/or insert_page_break Data Sections.

  • Data Fields are frequently referred to simply as "fields".

Data Model: data_table Data Models are leveraged during the Extract activity to collect data from documents (folder Batch Folders). Data Models are the root of a Data Element hierarchy. The Data Model and its child Data Elements define a schema for data present on a document. The Data Model's configuration (and its child Data Elements' configuration) define data extraction logic and settings for how data is reviewed in a Data Viewer.

Data Type: pin Data Types are nodes used to extract text data from a document. Data Types have more capabilities than quick_reference_all Value Readers. Data Types can collect results from multiple extractor sources, including a locally defined extractor, child extractor nodes, and referenced extractor nodes. Data Types can also collate results using Collation Providers to combine, sift and manipulate results further.

Document Type: description Document Type nodes represent a distinct type of document, such as an invoice or a contract. Document Types are created as child nodes of a stacks Content Model or a collections_bookmark Content Category. They serve three primary purposes:

  1. They are used to classify documents. Documents are considered "classified" when the folder Batch Folder is assigned a Content Type (most typically, a Document Type).
  2. The Document Type's data_table Data Model defines the Data Elements extracted by the Extract activity (including any Data Elements inherited from parent Content Types).
  3. The Document Type defines all "Behaviors" that apply (whether from the Document Type's Behavior settings or those inherited from a parent Content Type).

Extract: export_notes Extract is an Activity that retrieves information from folder Batch Folder documents, as defined by Data Elements in a data_table Data Model. This is how Grooper locates unstructured data on your documents and collects it in a structured, usable format.

Field Class: input Field Classes are NLP (natural language processing) based extractor nodes. They find values based on some natural language context near that value. Values are positively or negatively associated with text-based "features" nearby by training the extractor. During extraction, the extractor collects values based on these training weightings.

  • Field Classes are most useful when attempting to find values within the flow of natural language.
  • Field Classes can be configured to distinguish values within highly structured documents, but this type of extraction is better suited to simpler "extractor nodes" like quick_reference_all Value Readers or pin Data Types.
  • Advances in large-language models (LLMs) have largely made Field Classes obsolete. LLM-based extraction methods in Grooper (such as AI Extract) can achieve similar results with nowhere near the amount of set up.

Review: person_search Review is an Activity that allows user attended review of Grooper's results. This allows human operators to validate processed contract Batch Page and folder Batch Folder content using specialized user interfaces called "Viewers". Different kinds of Viewers assist users in reviewing Grooper's image processing, document classification, data extraction and operating document scanners.