Labelset-Based (Classification Method)

From Grooper Wiki
Revision as of 10:59, 30 January 2026 by Rpatton (talk | contribs) (// via Wikitext Extension for VSCode)

WIP

This article is a work-in-progress or created as a placeholder for testing purposes. This article is subject to change and/or expansion. It may be incomplete, inaccurate, or stop abruptly.

This tag will be removed upon draft completion.


This article is about the current version of Grooper.

Note that some content may still need to be updated.

2025

"Labelset-Based" is a Classify Method that leverages the labels defined via a Labeling Behavior to classify folder Batch Folders.

Introduction

Labelset-Based Classification is a Classification method that uses the configured Label Sets for each Document Type to determine the best match. It analyzes a document's text for the presence of known labels (Data Fields, field headers, table column names, section titles, etc.) and scores each candidate type based on label coverage and quality. The highest-scoring Document Type is selected.

Use Labelset-Based Classification when your documents are semi-structured: they consistently use similar text labels to identify data, but layouts may vary by vendor, version, or template.

Compared to training-based or rules-only methods, Labelset-Based Classification focuses on label presence rather than learned features, making it fast to onboard new Document Types by creating or updating Label Sets.

How it works (brief):

  1. For each candidate Document Type in the selected Content Model scope, the method loads its Label Set.
  2. It quickly prescans for label words (optional), then executes the label readers to find matches.
  3. It computes a score based on the number of matched labels and their quality, applies classification rules, and selects the best-scoring Document Type.

When to use

Ideal use cases

  • Semi-structured sets where consistent labels (e.g., “Invoice Number”, “Policy ID”, “Patient Name”) appear even if formatting differs.
  • Solutions needing rapid onboarding of new templates by authoring Label Sets rather than retraining models.
  • Mixed layouts where label presence and quality reliably indicate Document Type.

Real-world example

  • Accounts payable: supplier invoices vary widely in layout, but common labels (such as “Invoice Number”, “Invoice Date”, “Total”, “Line Items”) are present. Labelset-Based Classification correctly identifies invoice types using those labels without extensive training data.

Prerequisites

  • A Content Model with one or more child Document Types in scope.
  • A Labeling Behavior configured on the parent Content Model or Content Category so Label Sets are available.
  • Each Document Type should have a defined Label Set with labels for its expected fields, sections, and tables configured.
  • Recognized text on the documents must be available for classification.

How to configure Labelset-Based Classification

Setting up Labelset-Based Classification involves collecting labels using a Labeling Behavior. Grooper will then use those collected labels to determine which Document Type to assign to which document.

How Labelset-Based Classification works

Step-by-step

  1. Navigate to the Content Model in your Project.
  2. In the Content Model properties, set "Classification Method" to Labelset-Based.
  3. Apply a Labeling Behavior on the Content Model or Content Category to enable Labelsets.
  4. Click over to the Labels tab.
  5. Define one or more labels for each key Data Field, Data Section, and/or Data Table.
    • Each Data Field, Data Section, or Data Table must have a Labelset-aware Value Extractor set on each node for Labels to be collected.
  6. Test your configuration:
    • Use a Classify Batch Process step to test and evaluate results.
    • Review candidates and final assignment in the Classification Tester.
    • If needed, refine Label Sets (add alternate label versions, clarify headers/footers, mark volatile labels) and retest.


Tips

  • Start with essential labels (headers, table columns) that reliably identify the type.
  • Include alternate label versions for common variations (e.g., “Invoice #”, “Inv No.”).
  • Leverage first-page labels when documents are highly identifiable on page 1.
  • Use "Page Scope - Classification" to speed up classification on large documents.

Encountering False Positives

What happens if you come across two documents that have the exact same Labels? How can Grooper determine which Document Type to assign? The short answer is that it can't, and we must provide it additional information to understand which document is which.

The demo below shows a couple of situations that can occur should two different documents contain the same labels:


The labels that have been set in our example are being used for extraction purposes. However, not all labels collected on a document have to be used for extraction. To aid in classification, we can add Custom Labels that are not tied to extraction, but will be considered when running classification.

To add a custom Label:

  1. Click over to the "Labels" tab.
  2. Select a document classified with the desired Document Type.
  3. Click inside the text box for whichever Data Element you want to be the parent of your custom label. Usually this will be the Data Model.
  4. At the top of the Labels panel, locate and click the "Add a New Label" icon.
  5. At the bottom of the drop down, enter in a name for the new Custom Label.
    • It's recommended to choose a name that describes the label's purpose such as "Classification Context".
  6. Collect a Custom Label that contains any text that is present on the current Document Type but won't be present on any other Document Type.
  7. Repeat the process for any other documents that needs further context.
  8. Test your classification to confirm accurate results.

Using Volatile Labels

There are times that you may have a label that appears on some documents but not on others of the same Document Type. In these cases you may not want to use those labels for classification since they do not always appear on the document. For any situation where you do not want a label to be used for classification, you can turn it into a Volatile Label.

  1. Navigate to the "Labels" tab on the Content Type.
  2. Click the thumbs up or thumbs down icon next to the label you want to set as a Volatile Label.
  3. In the pop-up menu, locate the Volatile property and click the checkbox to change from false to true.
  4. Save your changes. That label will no longer be used for Classification.

Properties overview

The following properties are specific to the Labelset-Based classification method. Property names are shown as they appear in the UI.

Labelset-Based

  • "Prescan Threshold"
    • Definition: The percentage of labelset words that must be present in the document to trigger full labelset extraction.
    • Remarks:
      • 0% (default): All Label Sets are fully executed for every document, ensuring maximum thoroughness.
      • Greater than 0%: Only Label Sets where at least the specified percentage of words are found in the document are executed, improving performance by skipping unlikely matches.
      • Example: If set to 0.5 (50%), a Label Set is fully executed only if at least half of its words are found in the document text.
      • Adjust to balance speed vs. thoroughness for your solution.
    • Purpose / use case: Use higher values to speed up classification in solutions with many document types or large batches, and lower values when completeness is more important than speed.

Related settings affecting Labelset-Based Classification

  • Content Model – "Page Scope - Classification"
    • Definition: Limits the number of pages analyzed during classification.
    • Purpose / use case: Speeds up OCR and classification by focusing on the most relevant pages (e.g., first page).
  • Licensing – Classification Volume
    • Definition: Tracks the quantity of classifications performed (“documents”).
    • Purpose / use case: Ensure available classification volume covers your processing needs.

Testing and troubleshooting

If no type is selected:

  • Verify that the parent object has a Labeling Behavior.
  • Confirm each candidate Document Type has a non-empty Label Set.
  • Check OCR text availability; run or configure OCR earlier in the Batch Process.
  • Lower "Prescan Threshold" to avoid skipping valid types.

If classification is slow:

  • Increase "Prescan Threshold" to skip unlikely types.
  • Reduce "Page Scope - Classification" to analyze fewer pages.

If classification selects the wrong type:

  • Strengthen distinctive labels in each Label Set.
  • Add alternate versions for common label variations.
  • Review first-page labels if the first page is most indicative.
  • Use classification rules on Document Types (such as positive/negative extractors or page count constraints) to reinforce selection.

Keyboard and commands

  • To quickly test, use the Classify command on selected Batch Folders.
  • Use standard shortcuts for running steps, e.g., Alt + F5 in appropriate testing contexts (varies by tester UI; see local documentation).
  • For separation scenarios, verify first-page labels and related behaviors before running the Separate activity.

Summary

Labelset-Based Classification provides accurate typing for semi-structured document sets by matching known labels per Document Type. With minimal setup—a Labeling Behavior and Label Sets—it enables rapid onboarding, clear diagnostics, and strong performance, especially when tuned with "Prescan Threshold" and "Page Scope - Classification".