Collecting Labels (Functionality)

From Grooper Wiki

This article is about the current version of Grooper.

Note that some content may still need to be updated.

2025

WIP

This article is a work-in-progress or created as a placeholder for testing purposes. This article is subject to change and/or expansion. It may be incomplete, inaccurate, or stop abruptly.

This tag will be removed upon draft completion.

Introduction

In Grooper, Label Sets are collections of text labels that map the words and phrases found on a document to specific Data Fields, Data Sections, and Data Tables in your Data Model. Label Sets are essential for extracting and classifying data from semi-structured documents, where the same information may appear under different headings or in different layouts. For a full overview, see the Label Sets article.

Before you can use Label Sets, you must first enable label-driven extraction by applying a Labeling Behavior to your Content Type. The Labeling Behavior activates label-based extractors and adds a "Labels" tab to the Design Page, where Label Sets are created and managed. For more information, see Labeling Behavior.

Once Labeling Behavior is set, the next step is to collect labels. This process involves identifying and mapping the actual text labels from your sample documents to the appropriate Data Elements in your Data Model. This article explains how to collect labels and why it is a critical step in Grooper’s extraction process.

What is a Label?

A Label in Grooper is a specific word or phrase found on a document that identifies a piece of data, such as "Invoice Number", "Date of Service", or "Total Due". Labels are used to locate and extract data by associating them with Data Fields, Data Sections, or Data Tables. Labels can vary between document types (e.g., "Inv #", "Bill No.", or "Invoice Number" may all refer to the same field).

Labels are context-aware: their meaning is determined by their position and relationship to other elements on the document, a concept known as Data Context. By collecting labels, you help Grooper understand how to find and extract the right data, even when documents use different terminology.

How collecting labels works

Follow these steps to collect labels for your Label Sets:

  1. Navigate to your Content Model in the Grooper Design Studio.
  2. Select the desired Content Type (the document type you want to configure).
  3. Ensure that Labeling Behavior is enabled for this Content Type. If not, add it from the Behaviors section.
  4. Click on the Labels tab on the Design Page. This tab appears when Labeling Behavior is active.
  5. Open a sample document of the selected Content Type.
  6. For each Data Field, Data Section, or Data Table in your Data Model:
    1. Highlight or select the text label as it appears on the document (e.g., "Invoice Number").
    2. Assign the selected label to the corresponding Data Element.
    3. Repeat for all variations of the label that may appear on different documents (e.g., "Inv #", "Bill No.").
  7. Save your changes to update the Label Set for this Content Type.
  8. Repeat the process for each Content Type you wish to configure.

Once labels are collected, Grooper’s label-aware extractors (such as Labeled Value, Labeled OMR, and Tabular Layout) can use them to accurately extract data, even from documents with varying layouts and terminology.

Label Sets collecting labels: related articles

For more information on the concepts and tools involved in collecting labels, see:

  • Data Context – How context determines the meaning of labels and data extraction.
  • Label Sets – Overview of Label Sets and their role in Grooper.
  • Labeling Behavior – How to enable and configure label-driven extraction.
  • Labeled Value – Extractor for reading labeled field values.
  • Labeled OMR – Extractor for labeled checkboxes and groups.
  • Tabular Layout – Extractor for tables using header, column, and footer labels.

Collecting labels is a foundational step in Grooper’s semi-structured data extraction process, enabling rapid onboarding of new document types and robust, context-aware data extraction.