Search Classifier (Classify Method)

"Search Classifier" is a Classify Method that classifies documents (folder Batch Folders) by finding similar documents in a document search index. The Search Classifier method uses an embeddings model and vector similarity to give an unclassified document the same description Document Type as its closest match in the search index.

About

Search Classifier is a powerful classification method in Grooper that leverages AI-powered vector search to assign Document Types based on content similarity. This approach enables fast, deterministic, and highly accurate classification by comparing incoming documents to a pre-indexed set of documents using vector embeddings.

What is Search Classifier?

Search Classifier is a classification method that uses vector-based search technology to analyze and assign the most appropriate Document Type to each Batch Folder. Unlike traditional rules-based or machine learning classifiers (the Lexical method), Search Classifier relies on content similarity, comparing the content of new documents to those already indexed in an AI Search index.

What is Search Classifier for?

The Search Classifier is designed for organizations that want to automate document classification using the actual content of their documents, ensuring repeatable and auditable results. It is especially useful when:

Either, you have a well-curated set of indexed documents representing each Document Type.
Or, it is useful to compare to a large set of documents added to a document index that grows over time.
You want classification to be based strictly on similarity to existing, known documents.
You require deterministic results without the unpredictability of generative AI/large language models (LLMs).

How does Search Classifier work?

Search Classifier operates by translating the text content of each document into a high-dimensional vector using a configured embeddings model. Each vector is an array of numbers that represents the semantic meaning of a portion of the document (called a "chunk"). Then, vectors on the unclassified document are compared to document vectors in an AI Search index for similarity. The document is assigned the Document Type of its closest match in the index.

Search Classifier works by performing the following steps:

Vector embedding generation: The document is broken into chunks and an embeddings model collects vectors for each chunk.
Vector search query: Grooper sends a vector query to the AI Search index to retrieve the most similar indexed documents.
Result ranking: The vector query returns a list of documents with a "similarity score". The higher the score, the closer the similarity between the vectors, and the more similar the document is.
- The search results are ranked by cosine similarity between the vector embeddings.
Type assignment: The most probable Content Type is assigned based on the top match. The document is assigned the Document Type assigned to its most similar match in the search index.

General configuration and use

To use the Search Classifier in Grooper, follow these steps:

1. Add an LLM Connector to the Grooper Repository

An LLM Connector is required to access embeddings models. The Search Classifier method will not function without an embeddings model. For detailed information on how to add an LLM Connector and information about different LLM Providers, visit the LLM Connector article.

2. Configure Indexing Behavior and create the search index

An Indexing Behavior is required to select the embeddings model Search Classifier uses, add documents to an AI Search index, and ensure documents are indexed with vector embeddings.

Be sure to:

Add an Indexing Behavior to the Content Model (or appropriate Content Type).
Under "Options", enable "Vector Search".
Using the "Embeddings Model" editor, select an embeddings model.
Right click the Content Model (or Content Type where the Indexing Behavior is configured).
Execute the "Search > Create Search Index" command to create the search index.
- There's more to configuring an Indexing Behavior than described here to take advantage of the Search Page or AI Assistants. For more in depth information on Indexing Behaviors, please visit the Indexing Behavior article.

3. Add documents to the search index

You will need to add documents to the search index first before Search Classifier can work. There should be at least one example of each Document Type added to the search index. There are several ways to add documents to a search index.

The simplest (but most manual) way is to:

Right click a document and execute the "Assign Document Type" command.
Using the "Content Type" dropdown, select the appropriate Document Type.
Press "Execute" to assign the document its Document Type.
Once it has a Document Type, right click the document again and execute the "Search > Add to Index" command.

For information on how to execute the Add to Index command in a Batch Process, the "bulk" Submit Indexing Job command, and the Indexing Service, visit this section of the AI Search article.

4. Configure Search Classifier

Search Classifier is a Classify Method. It can be selected and configured using a Content Model's "Classification Method" editor.

On the Content Model, select "Classification Method" in the property grid. Using its dropdown editor, select "Search Classifier".
(Optional) Configure the "Filter" and "k" properties.
- Filter - Restricts which indexed documents are considered (using an OData filter expression)
  - Enter a valid OData filter expression referencing fields in your search index.
  - The filter is applied to all vector search queries performed by the Search Classifier.
  - If left blank, all indexed documents are eligible for matching.
  - Example: Restrict classification to documents where a "Category" field is set to "Contracts".
    - Category eq 'Contracts'
- k - Controls how many similar documents are retrieved for each classification attempt.
  - The default is 10 and will retrieve the 10 most similar indexed items for each vector query.
  - Increasing 'k' may improve classification accuracy by considering more candidates, but can increase processing time and the risk of false positives.
  - Lower values may speed up classification but could miss relevant matches in large or diverse indexes.

5. Run Classification

When the Classifiy step/activity runs on the document, Search Classifier does its thing. It takes embeddings for the document, compares them to document embeddings in the search index, rounds up the most similar items, determines which document is most similar, and assigns the Batch Folder the same Document Type.

When testing a Classify step in a Batch Process (using the Tester tab), you can review Diagnostics to help validate classification results. This will also help you determine if you need to refine the documents in your index used for classification (either by adding more documents to the search index or by adding/adjusting a Filter for Search Classifier).

The Execution Log.txt diagnostic summarizes hits and classification candidates, showing match scores for potential Document Types.
The Search Results.json diagnostic contains the raw results returned from the search index for each classification attempt.

Best practices

Ensure your Indexing Behavior is properly configured and the index is populated with high-quality, representative documents.
Use the "Filter" property to improve accuracy or enforce business rules.
Adjust the "k" property to balance accuracy and performance.
Review diagnostic outputs to validate classification results and refine your index as needed.

Search Classifier: More deterministic than LLM Classifier

The Search Classifier method does not use large language models (LLMs) or "generative AI". It only uses embeddings models to collect vector embeddings for each document. Classification is based solely on vector similarity to indexed content, making results predictable, repeatable, and easy to audit.

Embeddings models, are "deterministic".

They produce the same output every time, given the same input under the same conditions (i.e. using the same model).
The same vectors will be collected from the same document as long as the same embeddings model is used every time it runs on a document.
Vector comparisons are similarly deterministic. The equations used to measure the similarities between vectors on document A and vectors on document B are going to be the same every time that similarity score is measured.

Large language models (and generative AI as a whole) are 'not deterministic.

Generative models use sampling techniques to produce varied outputs.
While it's likely an LLM will generate similar responses to the same input, there is no guarantee the response will be identical every time it runs on a document.

Knowing this, you can contrast the "Search Classifier" method with the "LLM Classifier" method.

The Search Classifier method uses embeddings models, vector comparison, and a search index to determine a document's similarity to documents already present in a search index. An unclassified document is given the same Document Type as its closest match in the search index.
- Search Classifier is deterministic.
- It requires an embeddings model and an AI Search index to operate.
The LLM Classifier method uses a large language model to choose from a list of possible Document Types. The LLM is handed the document, the list of Document Types and is (more or less) asked the question "What Document Type is this?". The LLM then generates a response, from which Grooper assigns the document a Document Type.
- LLM Classifier is not deterministic (it is generative.)
- It requires only a large language model to operate.

Best Practices: How to avoid index alignment errors

Related concepts

Content Model: The root object in Grooper that organizes Document Types and classification logic.
Document Type: The object in Grooper that represents a distinct kind of document. Critical for document classification and extraction.
Content Type: The object assigned by classification. Most typically this is a Document Type.
Batch Folder: The item being classified.
Classify: The activity that applies classification to Batch Folders.
Indexing Behavior: The configuration that enables vector search and connects to your search index.