Labelset-Based (Classification Method): Difference between revisions

From Grooper Wiki
Tag: Removed redirect
No edit summary
Line 11: Line 11:


<blockquote>{{#lst:Glossary|Labelset-Based}}</blockquote>
<blockquote>{{#lst:Glossary|Labelset-Based}}</blockquote>
== Introduction ==
'''Labelset-Based Classification''' is a [[Classification]] method that uses the configured [[Label Sets]] for each [[Document Type]] to determine the best match. It analyzes a document's text for the presence of known labels (Data Fields, field headers, table column names, section titles, etc.) and scores each candidate type based on label coverage and quality. The highest-scoring Document Type is selected.
Use Labelset-Based Classification when your documents are semi-structured: they consistently use similar text labels to identify data, but layouts may vary by vendor, version, or template.
Compared to training-based or rules-only methods, Labelset-Based Classification focuses on label presence rather than learned features, making it fast to onboard new Document Types by creating or updating Label Sets.
'''How it works (brief):'''
# For each candidate Document Type in the selected [[Content Model]] scope, the method loads its '''Label Set'''.
# It quickly prescans for label words (optional), then executes the label readers to find matches.
# It computes a score based on the number of matched labels and their quality, applies classification rules, and selects the best-scoring Document Type.
== When to use ==
'''Ideal use cases'''
* Semi-structured sets where consistent labels (e.g., “Invoice Number”, “Policy ID”, “Patient Name”) appear even if formatting differs.
* Solutions needing rapid onboarding of new templates by authoring Label Sets rather than retraining models.
* Mixed layouts where label presence and quality reliably indicate Document Type.
'''Real-world example'''
* Accounts payable: supplier invoices vary widely in layout, but common labels (such as “Invoice Number”, “Invoice Date”, “Total”, “Line Items”) are present. Labelset-Based Classification correctly identifies invoice types using those labels without extensive training data.
'''Prerequisites'''
* A Content Model with one or more child Document Types in scope.
* A [[Labeling Behavior]] configured on the parent Content Model or [[Content Category]] so Label Sets are available.
* Each Document Type should have a defined '''Label Set''' with labels for its expected fields, sections, and tables configured.
* Recognized text on the documents must be available for classification.

Revision as of 09:53, 28 January 2026

WIP

This article is a work-in-progress or created as a placeholder for testing purposes. This article is subject to change and/or expansion. It may be incomplete, inaccurate, or stop abruptly.

This tag will be removed upon draft completion.


This article is about the current version of Grooper.

Note that some content may still need to be updated.

2025

"Labelset-Based" is a Classify Method that leverages the labels defined via a Labeling Behavior to classify folder Batch Folders.

Introduction

Labelset-Based Classification is a Classification method that uses the configured Label Sets for each Document Type to determine the best match. It analyzes a document's text for the presence of known labels (Data Fields, field headers, table column names, section titles, etc.) and scores each candidate type based on label coverage and quality. The highest-scoring Document Type is selected.

Use Labelset-Based Classification when your documents are semi-structured: they consistently use similar text labels to identify data, but layouts may vary by vendor, version, or template.

Compared to training-based or rules-only methods, Labelset-Based Classification focuses on label presence rather than learned features, making it fast to onboard new Document Types by creating or updating Label Sets.

How it works (brief):

  1. For each candidate Document Type in the selected Content Model scope, the method loads its Label Set.
  2. It quickly prescans for label words (optional), then executes the label readers to find matches.
  3. It computes a score based on the number of matched labels and their quality, applies classification rules, and selects the best-scoring Document Type.

When to use

Ideal use cases

  • Semi-structured sets where consistent labels (e.g., “Invoice Number”, “Policy ID”, “Patient Name”) appear even if formatting differs.
  • Solutions needing rapid onboarding of new templates by authoring Label Sets rather than retraining models.
  • Mixed layouts where label presence and quality reliably indicate Document Type.

Real-world example

  • Accounts payable: supplier invoices vary widely in layout, but common labels (such as “Invoice Number”, “Invoice Date”, “Total”, “Line Items”) are present. Labelset-Based Classification correctly identifies invoice types using those labels without extensive training data.

Prerequisites

  • A Content Model with one or more child Document Types in scope.
  • A Labeling Behavior configured on the parent Content Model or Content Category so Label Sets are available.
  • Each Document Type should have a defined Label Set with labels for its expected fields, sections, and tables configured.
  • Recognized text on the documents must be available for classification.