Data Instance: Difference between revisions

From Grooper Wiki
No edit summary
 
(One intermediate revision by the same user not shown)
Line 41: Line 41:
== Types of Data Instances ==
== Types of Data Instances ==


Grooper’s Data Instance hierarchy mirrors the structure of the Data Model and the document itself. The main types of Data Instances include:
Grooper’s Data Instance hierarchy mirrors the structure of the Data Model and the document itself. The primary types of Data Instances include:


* '''Document Instance:''' Represents the entire document’s extracted data. It is the root of the Data Instance hierarchy for a document (just as a Data Model is the root of a Data Element schema).
* '''Document Instance:''' Represents the extracted data for an entire document. It is the root of the Data Instance hierarchy (just as a {{IconName|Data Model}} [[Data Model]] is the root of a Data Element schema).
* '''Field Instance:''' Represents the value of a Data Field, including its validation status, alternate candidates, and annotations.
** Field Instances may be children of a Document Instance or Section Instance (just as Data Fields may be children of a Data Model or Data Section).
* ''Section Related Instances'': There are two Data Instances related to Data Section extraction.
** '''Section Instance:''' Represents a logical grouping of related fields, tables, or nested sections within a document, such as "Patient Information" or "Line Items".
*** Section Instances may be children of a Document Instance or Section Instance (just as Data Sections may be children of a Data Model or Data Section).
** '''Section Instance Collection:''' Represents a collection of repeating Section Instances.
*** The Section Instance Collection is used for multi-instance Data Sections (e.g., multiple claims or line items). Multi-instance Data Sections extract one Section Instance per result produced by its Section Extract method. The Section Instance Collection serves as the parent container for multiple Section Instances.
*** Single-instance Data Sections will not have a parent Section Instance Collection in their Data Instance hierarchy.
*** Section Instance Collections may be children of a Document Instance, Section Instance Collection, or Section Instance (just as multi-instance Data Sections may be children of a Data Model, multi-instance Data Section or single-instance Data Section).
* ''Table Related Instances'': There are four Data Instances related to Data Table Extraction.
** '''Table Instance:''' Represents an extracted table, including its rows, columns, and headers. This is the parent instance or "container instance" for all Data Instances related to extracting a Data Table. Table Instances may be children of a Document Instance or Section Instance (just as Data Tables may be children of a Data Model or single-instance Data Section).
** '''Table Row Instance:''' Represents a single row within a Table Instance, containing one Table Cell Instance for each column. Each Table Row Instance is a child of the single Table Instance.
** '''Table Cell Instance:''' Represents the value of a single cell in a Table Row Instance. Each Table Cell Instance corresponds to a Data Column in the Data Table. Each Table Cell Instance is a child of each Table Row Instance.
** '''Table Header Instance:''' Represents the header row(s) or column(s) of a Table Instance. The Table Header Instance aids in data extraction for many Table Extract Methods and is used for validation in the Data Viewer and Tester tabs.
* ''Specialized Instances:'' The following Data Instances have specialized uses for various Value Extractors.
** '''Labeled Instance:''' Represents a value with an associated label, often used for fields with explicit labels in the document.
*** Label Instance locations are outlined in blue on a Document Viewer.
*** Example: The Labeled Value extractor produces two instances when it returns data: one for its Label Extractor and one for its Value Extractor. It utilizes Label Instances for its Label Extractor component.
** '''Checkbox Instance:''' Represents a checkbox or similar binary field extracted from the document. Checkbox instances are determined by the checkbox detection found in the [[Box Detection]] and [[Box Removal]] IP Commands. Checkbox Instances are highlighted green when checked and red when not checked.
* '''Data Instance:''' This is the base class from which all other Data Instances are inherited. Extractors ([[Value Extractor]]s and [[Extractor Node]]s) return a list of Data Instances as their results.
** These Data Instance results are utilized to form a Data Model's Data Instance hierarchy.
** These Data Instance results are also utilized by Grooper [[Activity|Activities]], [[Classify Method]]s, [[Collation Provider]]s, and other configurations that require an extractor to function.


These instances and their hierarchical relationships allow Grooper to model documents of arbitrary complexity, including nested sections, repeating groups, and multi-level tables.
* '''Field Instance:''' Represents the value of a {{IconName|Data Field}} [[Data Field]], including validation state, alternate candidates, annotations, and metadata.
** Field Instances may be children of a Document Instance or a Section Instance (just as Data Fields may be children of a Data Model or Data Section).
 
* '''''Section-related Instances''''': There are two Data Instances associated with {{IconName|Data Section}} [[Data Section]] extraction.
** '''Section Instance:''' Represents a logical grouping of related fields, tables, or nested sections within a document (for example, "Patient Information" or "Line Items").
*** Section Instances may be children of a Document Instance or another Section Instance (just as Data Sections may be children of a Data Model or Data Section).
** '''Section Instance Collection:''' Represents a collection of repeating Section Instances.
*** Section Instance Collections are used for multi-instance Data Sections (for example, multiple claims or line items).
*** Each result produced by a Section Extract method generates a Section Instance within the collection.
*** Single-instance Data Sections do not have a parent Section Instance Collection.
*** Section Instance Collections may be children of a Document Instance, Section Instance, or another Section Instance Collection (just as multi-instance Data Sections may be nested in the Data Model).
 
* '''''Table-related Instances''''': There are four Data Instances associated with {{IconName|Data Table}} [[Data Table]] extraction.
** '''Table Instance:''' Represents an extracted table, including its rows, columns, and headers. It serves as the container instance for all table-related Data Instances.
*** Table Instances may be children of a Document Instance or Section Instance (just as Data Tables may be children of a Data Model or single-instance Data Section).
** '''Table Row Instance:''' Represents a single row within a Table Instance.
*** Each Table Row Instance contains one Table Cell Instance per column and is a child of the Table Instance.
** '''Table Cell Instance:''' A specialized type of Field Instance used for table extraction, representing {{IconName|Data Column}} [[Data Column]] values in extracted rows.
*** Represents the value of a single cell corresponding to a Data Column within a Table Row Instance.
** '''Table Header Instance:''' Represents the header row(s) or column(s) of a Table Instance.
*** Table Header Instances support table extraction methods and are used for validation in the Data Viewer and Tester tabs.
 
* '''''Specialized Instances''''': These Data Instances support specialized extraction scenarios.
** '''Labeled Instance:''' Represents a value with an associated label, commonly used when fields are explicitly labeled in the document.
*** Label Instance locations are outlined in blue in the Document Viewer.
*** Example: The Labeled Value extractor produces two instances—one for the Label Extractor and one for the Value Extractor.
** '''Checkbox Instance:''' Represents a checkbox or similar binary field.
*** Checkbox Instances are detected using the [[Box Detection]] and [[Box Removal]] IP Commands.
*** Checked boxes are highlighted in green; unchecked boxes are highlighted in red.
 
* '''Data Instance:''' The base class from which all other Data Instances inherit.
** Extractors (including [[Value Extractor]]s and [[Extractor Node]]s) return Data Instances as their results.
** These results are used to construct a Data Model’s Data Instance hierarchy.
** Data Instances are also consumed by Grooper [[Activity|Activities]], [[Classify Method]]s, [[Collation Provider]]s, and other configurations that rely on extractors.
 
These Data Instance types and their hierarchical relationships allow Grooper to model documents of arbitrary complexity, including nested sections, repeating groups, and multi-level tables.


=== How Data Instances represent a document ===
=== How Data Instances represent a document ===


When Grooper processes a document, using the [[Extract]] activity, it builds a tree of Data Instances that reflects both the document’s content and the Data Model schema. For example:
When Grooper processes a document using the [[Extract]] activity, it builds a tree of Data Instances that reflects both the document’s content and the Data Model schema.


* The root '''Document Instance''' contains one or more Section Instances, each representing a logical part of the document.
* The root Document Instance contains one or more Field Instances representing each Data Field extracted from the document. These Field Instances may be direct children or further descended down the instance tree, depending on how the Data Model is architected.
* Section Instances may contain Field Instances, Table Instances, or even nested Section Instances.
* When Data Sections are present in a Data Model, the root Document Instance contains one or more Section Instances representing logical areas of the document.
* Table Instances contain Table Row Instances, which in turn contain Table Cell Instances for each Data Column.
* Section Instances may contain Field Instances, Table Instances, and nested Section Instances.
* Section Instance Collections group multiple Section Instances when a Data Section is configured for repeating records.
* When Data Tables are present in a Data Model, Table Instances contain Table Row Instances, which in turn contain Table Cell Instances for each Data Column.
* Section Instance Collections group multiple Section Instances when a Data Section is configured for repeating records (multi-instance mode).


This structure ensures that every piece of extracted data is precisely located, validated, and mapped to its intended schema element.
This structure ensures that extracted data is precisely located, validated, and mapped to its intended schema element.


== Key points ==
== Key points ==
Line 85: Line 99:
* Data Instances are also used to store user-entered values and calculated values stored in a Data Model.
* Data Instances are also used to store user-entered values and calculated values stored in a Data Model.
* Data Instances store its extracted (or entered or calculated) value, location on the document (if available), confidence and other metadata key to data collection, validation and user review.
* Data Instances store its extracted (or entered or calculated) value, location on the document (if available), confidence and other metadata key to data collection, validation and user review.
== Object Model info ==
<big>Grooper Type Name</big>
:''{{TypeName|Data Instance}}''
{{Inheritance|Grooper Object|Connected Object|Embedded Object|Data Instance}}
<big>Derived Types</big>
:::: '''Data Instance''' (''{{HelpLink|Data Element}}'')
::::: {{WikiHelpLinks|Checkbox Instance}}
::::: {{WikiHelpLinks|Data Element Instance}}
:::::: {{WikiHelpLinks|Element Container Instance}}
::::::: {{WikiHelpLinks|Document Instance}}
::::::: {{WikiHelpLinks|Section Instance}}
::::::: {{WikiHelpLinks|Section Instance Collection}}
:::::: {{WikiHelpLinks|Table Instance}}
:::::: {{WikiHelpLinks|Table Row Instance}}
::::: {{WikiHelpLinks|Field Instance}}
:::::: {{WikiHelpLinks|Table Cell Instance}}
::::: {{WikiHelpLinks|Labeled Instance}}
::::: {{WikiHelpLinks|Table Header Instance}}

Latest revision as of 15:13, 5 February 2026

This article is about the current version of Grooper.

Note that some content may still need to be updated.

2025

A Data Instance is a unit of data within a document. Data Instances form a hierarchy defined by the document’s data_table Data Model, from the document level down to individual variables Data Fields. They store extracted, entered, or calculated values along with associated metadata such as location and confidence.

Data Instances are the foundational objects Grooper uses to represent, organize, and manage extracted data from documents. They are composed of primarily two things: (1) An extracted value and (2) a location, the data's position/coordinates on the document (including what page number it's on).

Knowing the extracted data's value and location is critical for Grooper Review users to validate extraction results in the Data Viewer and for Grooper Design users to validate extractor design in Tester tabs.

Furthermore, the spatial relationship between Data Instances is often critical for how certain Grooper extraction operations function. For example, the Labeled Value extractor works by correlating the spatial relationship between a text label and a value. Two Data Instances work together, one for the label and one for the value, to produce the extractor's final output.

Data Instances also form a hierarchical tree that reflects the structure of a Data Model. This hierarchical structure that mirrors the logical and physical organization of document content. Each Data Instance corresponds to a specific element in the Data Model, such as a Data Field, Data Section, or Data Table. Coupling Data Instances to these Data Elements ensure:

  • Extracted data values are mapped to the correct Data Model schema.
  • Extracted data locations are visible to human reviewers in a Document Viewer.

What are Data Instances?

A Data Instance is an object that holds a piece of extracted data, along with its metadata, location, confidence, and relationships to other data. Data Instances are created automatically by Grooper’s extractors (Value Extractors and Extractor Nodes) as documents are processed. They are not typically created or configured directly by end users, but are visible and editable in the Data Viewer UI in Review.

  • Data Instances are also visible in "Tester" tabs when testing extractors and Data Elements. The Data Inspector UI allows users to inspect Data Instances and their metadata in Tester tabs.

Purpose and usage

Data Instances serve several key purposes in Grooper:

  • Representation of extracted data: Every value, field, table, or section extracted from a document is stored as a Data Instance.
  • Organization and hierarchy: Data Instances are organized in a tree-like structure, reflecting the logical layout of the document and its Data Model.
  • User interaction: In the Data Review UI, users can view, edit, and confirm Data Instances, ensuring data accuracy before export or further processing.
  • Automation and export: Data Instances are used by Grooper’s automation and export features to deliver structured data to downstream systems.

How Grooper utilizes Data Instances

Grooper leverages Data Instances throughout its data processing lifecycle:

  • Extraction: Data Instances are created and populated during the Extract activity, using the logic defined in the Data Model.
  • Validation: Each Data Instance tracks its validation status and error messages, supporting automatic and manual validation workflows.
  • Review: In the Data Viewer UI, users interact with Data Instances to review, edit, and confirm extracted data.
  • Automation: Data Instances are used to automate a variety of document processing tasks in Grooper.
    • Effectively, any Activity that uses extractors will at least indirectly utilize Data Instances, just by the sheer fact that all extractors return Data Instances as their result.
    • Instances formed by a Data Model are also used by features such as Data Rules, Lookup Specifications, and in expression environments.
  • Export: Structured data is exported from Grooper by traversing the Data Instance hierarchy, ensuring that all data is mapped and formatted according to the Data Model.

Types of Data Instances

Grooper’s Data Instance hierarchy mirrors the structure of the Data Model and the document itself. The primary types of Data Instances include:

  • Document Instance: Represents the extracted data for an entire document. It is the root of the Data Instance hierarchy (just as a data_table Data Model is the root of a Data Element schema).
  • Field Instance: Represents the value of a variables Data Field, including validation state, alternate candidates, annotations, and metadata.
    • Field Instances may be children of a Document Instance or a Section Instance (just as Data Fields may be children of a Data Model or Data Section).
  • Section-related Instances: There are two Data Instances associated with insert_page_break Data Section extraction.
    • Section Instance: Represents a logical grouping of related fields, tables, or nested sections within a document (for example, "Patient Information" or "Line Items").
      • Section Instances may be children of a Document Instance or another Section Instance (just as Data Sections may be children of a Data Model or Data Section).
    • Section Instance Collection: Represents a collection of repeating Section Instances.
      • Section Instance Collections are used for multi-instance Data Sections (for example, multiple claims or line items).
      • Each result produced by a Section Extract method generates a Section Instance within the collection.
      • Single-instance Data Sections do not have a parent Section Instance Collection.
      • Section Instance Collections may be children of a Document Instance, Section Instance, or another Section Instance Collection (just as multi-instance Data Sections may be nested in the Data Model).
  • Table-related Instances: There are four Data Instances associated with table Data Table extraction.
    • Table Instance: Represents an extracted table, including its rows, columns, and headers. It serves as the container instance for all table-related Data Instances.
      • Table Instances may be children of a Document Instance or Section Instance (just as Data Tables may be children of a Data Model or single-instance Data Section).
    • Table Row Instance: Represents a single row within a Table Instance.
      • Each Table Row Instance contains one Table Cell Instance per column and is a child of the Table Instance.
    • Table Cell Instance: A specialized type of Field Instance used for table extraction, representing view_column Data Column values in extracted rows.
      • Represents the value of a single cell corresponding to a Data Column within a Table Row Instance.
    • Table Header Instance: Represents the header row(s) or column(s) of a Table Instance.
      • Table Header Instances support table extraction methods and are used for validation in the Data Viewer and Tester tabs.
  • Specialized Instances: These Data Instances support specialized extraction scenarios.
    • Labeled Instance: Represents a value with an associated label, commonly used when fields are explicitly labeled in the document.
      • Label Instance locations are outlined in blue in the Document Viewer.
      • Example: The Labeled Value extractor produces two instances—one for the Label Extractor and one for the Value Extractor.
    • Checkbox Instance: Represents a checkbox or similar binary field.
      • Checkbox Instances are detected using the Box Detection and Box Removal IP Commands.
      • Checked boxes are highlighted in green; unchecked boxes are highlighted in red.
  • Data Instance: The base class from which all other Data Instances inherit.

These Data Instance types and their hierarchical relationships allow Grooper to model documents of arbitrary complexity, including nested sections, repeating groups, and multi-level tables.

How Data Instances represent a document

When Grooper processes a document using the Extract activity, it builds a tree of Data Instances that reflects both the document’s content and the Data Model schema.

  • The root Document Instance contains one or more Field Instances representing each Data Field extracted from the document. These Field Instances may be direct children or further descended down the instance tree, depending on how the Data Model is architected.
  • When Data Sections are present in a Data Model, the root Document Instance contains one or more Section Instances representing logical areas of the document.
  • Section Instances may contain Field Instances, Table Instances, and nested Section Instances.
  • When Data Tables are present in a Data Model, Table Instances contain Table Row Instances, which in turn contain Table Cell Instances for each Data Column.
  • Section Instance Collections group multiple Section Instances when a Data Section is configured for repeating records (multi-instance mode).

This structure ensures that extracted data is precisely located, validated, and mapped to its intended schema element.

Key points

  • Data Instances are foundational for Grooper's extraction, validation and data modeling system.
  • Data Instances are created by the extraction process.
  • Data Instances are also used to store user-entered values and calculated values stored in a Data Model.
  • Data Instances store its extracted (or entered or calculated) value, location on the document (if available), confidence and other metadata key to data collection, validation and user review.

Object Model info

Grooper Type Name

Grooper.Core.DataInstance

Inheritance

Grooper Object (Grooper.GrooperObject)
Connected Object (Grooper.ConnectedObject)
Embedded Object (Grooper.EmbeddedObject)
Data Instance (Grooper.Core.DataInstance)

Derived Types

Data Instance (Grooper.Core.DataElement)
Checkbox Instance (Grooper.Core.CheckBoxInstance)
Data Element Instance (Grooper.Core.DataElementInstance)
Element Container Instance (Grooper.Core.ElementContainerInstance)
Document Instance (Grooper.Core.DocumentInstance)
Section Instance (Grooper.Core.SectionInstance)
Section Instance Collection (Grooper.Core.SectionInstanceCollection)
Table Instance (Grooper.Core.TableInstance)
Table Row Instance (Grooper.Core.TableRowInstance)
Field Instance (Grooper.Core.FieldInstance)
Table Cell Instance (Grooper.Core.TableCellInstance)
Labeled Instance (Grooper.Core.FieldClassInstance)
Table Header Instance (Grooper.Core.TableHeaderInstance)