Data Instance

From Grooper Wiki

This article is about the current version of Grooper.

Note that some content may still need to be updated.

2025

A Data Instance is an encapsulation of text data within a document returned by Grooper's extractors. Data instances are the hierarchy of text data created by Grooper's extractors.

Data Instances are the foundational objects Grooper uses to represent, organize, and manage extracted data from documents. They are composed of primarily two things: (1) An extracted value and (2) a location, the data's position/coordinates on the document (including what page number it's on).

Knowing the extracted data's value and location is critical for Grooper Review users to validate extraction results in the Data Viewer and for Grooper Design users to validate extractor design in Tester tabs.

Furthermore, the spatial relationship between Data Instances is often critical for how certain Grooper extraction operations function. For example, the Labeled Value extractor works by correlating the spatial relationship between a text label and a value. Two Data Instances work together, one for the label and one for the value, to produce the extractor's final output.

Data Instances also form a hierarchical tree that reflects the structure of a Data Model. This hierarchical structure that mirrors the logical and physical organization of document content. Each Data Instance corresponds to a specific element in the Data Model, such as a Data Field, Data Section, or Data Table. Coupling Data Instances to these Data Elements ensure:

  • Extracted data values are mapped to the correct Data Model schema.
  • Extracted data locations are visible to human reviewers in a Document Viewer.

What are Data Instances?

A Data Instance is an object that holds a piece of extracted data, along with its metadata, location, confidence, and relationships to other data. Data Instances are created automatically by Grooper’s extractors (Value Extractors and Extractor Nodes) as documents are processed. They are not typically created or configured directly by end users, but are visible and editable in the Data Viewer UI in Review.

  • Data Instances are also visible in "Tester" tabs when testing extractors and Data Elements. The Data Inspector UI allows users to inspect Data Instances and their metadata in Tester tabs.

Purpose and usage

Data Instances serve several key purposes in Grooper:

  • Representation of extracted data: Every value, field, table, or section extracted from a document is stored as a Data Instance.
  • Organization and hierarchy: Data Instances are organized in a tree-like structure, reflecting the logical layout of the document and its Data Model.
  • User interaction: In the Data Review UI, users can view, edit, and confirm Data Instances, ensuring data accuracy before export or further processing.
  • Automation and export: Data Instances are used by Grooper’s automation and export features to deliver structured data to downstream systems.

How Grooper utilizes Data Instances

Grooper leverages Data Instances throughout its data processing lifecycle:

  • Extraction: Data Instances are created and populated during the Extract activity, using the logic defined in the Data Model.
  • Validation: Each Data Instance tracks its validation status and error messages, supporting automatic and manual validation workflows.
  • Review: In the Data Viewer UI, users interact with Data Instances to review, edit, and confirm extracted data.
  • Automation: Data Instances are used to automate a variety of document processing tasks in Grooper.
    • Effectively, any Activity that uses extractors will at least indirectly utilize Data Instances, just by the sheer fact that all extractors return Data Instances as their result.
    • Instances formed by a Data Model are also used by features such as Data Rules, Lookup Specifications, and in expression environments.
  • Export: Structured data is exported from Grooper by traversing the Data Instance hierarchy, ensuring that all data is mapped and formatted according to the Data Model.

Types of Data Instances

Grooper’s Data Instance hierarchy mirrors the structure of the Data Model and the document itself. The main types of Data Instances include:

  • Document Instance: Represents the entire document’s extracted data. It is the root of the Data Instance hierarchy for a document (just as a Data Model is the root of a Data Element schema).
  • Field Instance: Represents the value of a Data Field, including its validation status, alternate candidates, and annotations.
    • Field Instances may be children of a Document Instance or Section Instance (just as Data Fields may be children of a Data Model or Data Section).
  • Section Related Instances: There are two Data Instances related to Data Section extraction.
    • Section Instance: Represents a logical grouping of related fields, tables, or nested sections within a document, such as "Patient Information" or "Line Items".
      • Section Instances may be children of a Document Instance or Section Instance (just as Data Sections may be children of a Data Model or Data Section).
    • Section Instance Collection: Represents a collection of repeating Section Instances.
      • The Section Instance Collection is used for multi-instance Data Sections (e.g., multiple claims or line items). Multi-instance Data Sections extract one Section Instance per result produced by its Section Extract method. The Section Instance Collection serves as the parent container for multiple Section Instances.
      • Single-instance Data Sections will not have a parent Section Instance Collection in their Data Instance hierarchy.
      • Section Instance Collections may be children of a Document Instance, Section Instance Collection, or Section Instance (just as multi-instance Data Sections may be children of a Data Model, multi-instance Data Section or single-instance Data Section).
  • Table Related Instances: There are four Data Instances related to Data Table Extraction.
    • Table Instance: Represents an extracted table, including its rows, columns, and headers. This is the parent instance or "container instance" for all Data Instances related to extracting a Data Table. Table Instances may be children of a Document Instance or Section Instance (just as Data Tables may be children of a Data Model or single-instance Data Section).
    • Table Row Instance: Represents a single row within a Table Instance, containing one Table Cell Instance for each column. Each Table Row Instance is a child of the single Table Instance.
    • Table Cell Instance: Represents the value of a single cell in a Table Row Instance. Each Table Cell Instance corresponds to a Data Column in the Data Table. Each Table Cell Instance is a child of each Table Row Instance.
    • Table Header Instance: Represents the header row(s) or column(s) of a Table Instance. The Table Header Instance aids in data extraction for many Table Extract Methods and is used for validation in the Data Viewer and Tester tabs.
  • Specialized Instances: The following Data Instances have specialized uses for various Value Extractors.
    • Labeled Instance: Represents a value with an associated label, often used for fields with explicit labels in the document.
      • Label Instance locations are outlined in blue on a Document Viewer.
      • Example: The Labeled Value extractor produces two instances when it returns data: one for its Label Extractor and one for its Value Extractor. It utilizes Label Instances for its Label Extractor component.
    • Checkbox Instance: Represents a checkbox or similar binary field extracted from the document. Checkbox instances are determined by the checkbox detection found in the Box Detection and Box Removal IP Commands. Checkbox Instances are highlighted green when checked and red when not checked.
  • Data Instance: This is the base class from which all other Data Instances are inherited. Extractors (Value Extractors and Extractor Nodes) return a list of Data Instances as their results.
    • These Data Instance results are utilized to form a Data Model's Data Instance hierarchy.
    • These Data Instance results are also utilized by Grooper Activities, Classify Methods, Collation Providers, and other configurations that require an extractor to function.

These instances and their hierarchical relationships allow Grooper to model documents of arbitrary complexity, including nested sections, repeating groups, and multi-level tables.

How Data Instances represent a document

When Grooper processes a document, using the Extract activity, it builds a tree of Data Instances that reflects both the document’s content and the Data Model schema. For example:

  • The root Document Instance contains one or more Section Instances, each representing a logical part of the document.
  • Section Instances may contain Field Instances, Table Instances, or even nested Section Instances.
  • Table Instances contain Table Row Instances, which in turn contain Table Cell Instances for each Data Column.
  • Section Instance Collections group multiple Section Instances when a Data Section is configured for repeating records.

This structure ensures that every piece of extracted data is precisely located, validated, and mapped to its intended schema element.

Summary

Data Instances are the backbone of Grooper’s data extraction and management capabilities. By organizing extracted data into a hierarchical structure that mirrors the Data Model and document layout, Grooper enables powerful validation, review, automation, and export workflows. Understanding Data Instances is essential for designing effective Data Models and achieving high-quality data extraction results.