Search Classifier (Classify Method)

"Search Classifier" is a Classify Method that classifies documents (folder Batch Folders) by finding similar documents in a document search index. The Search Classifier method uses an embeddings model and vector similarity to give an unclassified document the same description Document Type as its closest match in the search index.

About

Search Classifier is a powerful classification method in Grooper that leverages AI-powered vector search to assign Document Types based on content similarity. This approach enables fast, deterministic, and highly accurate classification by comparing incoming documents to a pre-indexed set of documents using vector embeddings.

What is Search Classifier?

Search Classifier is a classification method that uses vector-based search technology to analyze and assign the most appropriate Document Type to each Batch Folder. Unlike traditional rules-based or machine learning classifiers (the Lexical method), Search Classifier relies on content similarity, comparing the content of new documents to those already indexed in an AI Search index.

What is Search Classifier for?

The Search Classifier is designed for organizations that want to automate document classification using the actual content of their documents, ensuring repeatable and auditable results. It is especially useful when:

Either, you have a well-curated set of indexed documents representing each Document Type.
Or, it is useful to compare to a large set of documents added to a document index that grows over time.
You want classification to be based strictly on similarity to existing, known documents.
You require deterministic results without the unpredictability of generative AI/large language models (LLMs).

How does Search Classifier work?

Search Classifier operates by translating the text content of each document into a high-dimensional vector using a configured embeddings model. Each vector is an array of numbers that represents the semantic meaning of a portion of the document (called a "chunk"). Then, vectors on the unclassified document are compared to document vectors in an AI Search index for similarity. The document is assigned the Document Type of its closest match in the index.

Search Classifier works by performing the following steps:

Vector embedding generation: The document is broken into chunks and an embeddings model collects vectors for each chunk.
Vector search query: Grooper sends a vector query to the AI Search index to retrieve the most similar indexed documents.
Result ranking: The vector query returns a list of documents with a "similarity score". The higher the score, the closer the similarity between the vectors, and the more similar the document is.
- The search results are ranked by cosine similarity between the vector embeddings.
Type assignment: The most probable Content Type is assigned based on the top match. The document is assigned the Document Type assigned to its most similar match in the search index.

General configuration and use

To use the Search Classifier in Grooper, follow these steps:

1. Add an LLM Connector to the Grooper Repository

An LLM Connector is required to access embeddings models. The Search Classifier method will not function without an embeddings model. For detailed information on how to add an LLM Connector and information about different LLM Providers, visit the LLM Connector article.

2. Configure Indexing Behavior and create the search index

An Indexing Behavior is required to select the embeddings model Search Classifier uses, add documents to an AI Search index, and ensure documents are indexed with vector embeddings.

Be sure to:

Add an Indexing Behavior to the Content Model (or appropriate Content Type).
Under "Options", enable "Vector Search".
Using the "Embeddings Model" editor, select an embeddings model.
Right click the Content Model (or Content Type where the Indexing Behavior is configured).
Execute the "Search > Create Search Index" command to create the search index.
- There's more to configuring an Indexing Behavior than described here to take advantage of the Search Page or AI Assistants. For more in depth information on Indexing Behaviors, please visit the Indexing Behavior article.

3. Add documents to the search index

You will need to add documents to the search index first before Search Classifier can work. There should be at least one example of each Document Type added to the search index. There are several ways to add documents to a search index.

The simplest (but most manual) way is to:

Right click a document and execute the "Assign Document Type" command.
Using the "Content Type" dropdown, select the appropriate Document Type.
Press "Execute" to assign the document its Document Type.
Once it has a Document Type, right click the document again and execute the "Search > Add to Index" command.

For information on how to execute the Add to Index command in a Batch Process, the "bulk" Submit Indexing Job command, and the Indexing Service, visit this section of the AI Search article.

4. Configure Search Classifier

Search Classifier is a Classify Method. It can be selected and configured using a Content Model's "Classification Method" editor.

On the Content Model, select "Classification Method" in the property grid. Using its dropdown editor, select "Search Classifier".
(Optional) Configure the "Filter" and "k" properties.
- Filter - Restricts which indexed documents are considered (using an OData filter expression)
  - Enter a valid OData filter expression referencing fields in your search index.
  - The filter is applied to all vector search queries performed by the Search Classifier.
  - If left blank, all indexed documents are eligible for matching.
  - Example: Restrict classification to documents where a "Category" field is set to "Contracts".
    - Category eq 'Contracts'
- k - Controls how many similar documents are retrieved for each classification attempt.
  - The default is 10 and will retrieve the 10 most similar indexed items for each vector query.
  - Increasing 'k' may improve classification accuracy by considering more candidates, but can increase processing time and the risk of false positives.
  - Lower values may speed up classification but could miss relevant matches in large or diverse indexes.

5. Run Classification

When the Classifiy step/activity runs on the document, Search Classifier does its thing. It takes embeddings for the document, compares them to document embeddings in the search index, rounds up the most similar items, determines which document is most similar, and assigns the Batch Folder the same Document Type.

When testing a Classify step in a Batch Process (using the Tester tab), you can review Diagnostics to help validate classification results. This will also help you determine if you need to refine the documents in your index used for classification (either by adding more documents to the search index or by adding/adjusting a Filter for Search Classifier).

The Execution Log.txt diagnostic summarizes hits and classification candidates, showing match scores for potential Document Types.
The Search Results.json diagnostic contains the raw results returned from the search index for each classification attempt.

Best practices

Ensure your Indexing Behavior is properly configured and the index is populated with high-quality, representative documents.
Use the "Filter" property to improve accuracy or enforce business rules.
Adjust the "k" property to balance accuracy and performance.
Review diagnostic outputs to validate classification results and refine your index as needed.

How to avoid index alignment errors

The Search Classifier method is unique in that it uses both documents stored in Grooper Batches and search index data in AI Search to classify documents. For things to run smoothly, the search index and the documents used for classification in Grooper need to be aligned. If they are not aligned, Search Classifier can error out when attempting to classify documents.

Misalignment Example 1: A document is deleted without updating the search index.
Misalignment Example 2: A document's Document Type is changed without updating the search index.

To avoid errors during a Classify step, you should keep your search index aligned with documents the Search Classifier method uses to determine a document's Document Type. This can be done in one of two ways:

Filter Search Classifier a dedicated set of classification examples.
Search Classifier's "Filter" property allows users to pick a subset of documents in the search index for classification. If you restrict this to a known set of document that you know will never be deleted or changed, you can avoid these types of index alignment errors when other/new documents are more in flux.
Keep the Indexing Service on.
The Indexing Service will continually index documents as they are brought into Grooper, have their data changed or are deleted. This service runs in the background, periodically polling the Grooper Repository to align documents with the search index.

Filter Search Classifier

Do this if you have a small set of documents you want to use for classification examples.

This scenario presumes you have a dedicated set of documents you want to use for classification. These documents will stay in one or more "Classification Example" Batches you create. You should add a hidden Boolean field to your Data Model called something like "Is Training Example". This field should be "False" by default and manually set to "True" by Grooper designers for documents in the "Classification Example" Batches. Then, you can use Search Classifier's Filter property to only use documents whose "Is Training Example" field is "True". This will ensure the Search Classifier never compares unclassified documents to anything in the search index besides documents in the "Classification Example" Batches.

The general steps for this setup would be as follows:

Add a Data Field named "Is Training Example" to the root Data Model for your Content Model (or whichever Content Type has the Indexing Behavior configured).
Make it a Boolean field by setting its "Value Type" to "Boolean".
Make its default value "false" by setting its "Default Value" property to False.

Gather up documents that best represent each of the Document Types in your Content Model.
- You can organize them however you want in your Grooper Repository. However, a good practice is to create one or more "Classification Example" Batches in the Test branch of the Batches folder.
Assign them their correct Document Type.
For each document, change the "Is Training Example" field from False to True.
Add these documents to the search index.
- Ex: Right click each Batch Folder and execute the "Search > Add to Index" command.
Navigate to the Content Model and expand the "Search Classifier" (Classification Method) properties.
Select the "Filter" property and enter this expression: Is_Training_Example eq true
- true must be in all lower case letters.
- Is_Traning_Example is just whatever you named your Boolean Data Field with its name cleansed for the search filter's requirements.
Make the "Is Training Example" Data Field a hidden field by setting its "Visible" property to False.
- This is optional but encouraged. Doing this will obscure this property from Review users and prevent them from accidentally editing it.

This will ensure the Search Classifier only compares documents to those whose "Is Training Example" field is set to "true" when attempting to classify new documents.

Enable the Indexing Service

Do this if you want to classify according to all documents in the search index.

The Indexing Service is a Grooper service that runs in the background, continuously updating your search index(es). Installing this service will keep your index up to date as documents flow through a Grooper Repository. It continuously polls the Grooper Repository, looking for documents that need to be added to or removed from the search index and whos search index data needs to be changed.

Importantly, having an Indexing Service running will help avoid issues with Search Classifier throwing errors when a document in the Grooper Repository is misaligned with its values in the search index. The Indexing Service will automatically update the search index when:

Documents are assigned a Content Type (e.g. Document Type)
A document's Content Type changes.
A document is deleted.

With an up-to-date search index that accurately matches documents in the Grooper Repository, Search Classifier will not throw an error due to an index being misaligned with documents in Grooper.

You can still use the filtering method described in the previous section and run an Indexing Service.
Be aware: There will be a brief time between the Indexing Service's polling cycles and the time it takes to index new entries where the Grooper Repository is technically misaligned with the search index. However, this is unlikely to cause issues in real world scenarios.

To install an Indexing Service:

Open Grooper Command Console (GCC).
- GCC must be run as an administrator.
- GCC can be accessed from the Windows Start menu.
- Or, the executable gcc.exe can be found in the Grooper install directory.
Use the following GCC command to install the Indexing Service:
```
services install <connectionNo> IndexingService <userName> <password>
```
- <connectionNo> is a required parameter. Enter the connection number for the Grooper Repository using the service. If you don't know the connection number, enter the connections list command for a list of all Grooper Repository connections.
- <userName> is a required parameter. Enter the user name to run the service under. This user must have the "Log on as Service" permission in Windows.
- <password> is a required parameter. Enter the password for the provided user name.
Enter ? to prompt the user for their password. This will mask the entered password.
After attempting the install GCC will present an installation log. At the end of this log it will inform you if:
- The service was successfully installed.
- Or, the service installation FAILED.
Verify you also have an Activity Processing service installed.
- After polling the Grooper Repository, the Indexing Service creates a "Processing Job" to update the search index. An Activity Processing service needs to be running in order to start and complete the Processing Job.
- To install an Activity Processing service, execute this command in GCC:
```
services install <connectionNo> ActivityProcessing <userName> <password> [threadCount] [queueName]
```
  - [threadCount] and [queueName] are optional.
  - If you want to specify a specific thread count, replace [threadCount] with an appropriate integer. Not setting an integer here will assume the default setting of "multiple" threads.
  - If you want to specify a Processing Queue], replace [queueName] with a Processing Queue's name. Leaving this blank will assume the Default processing queue.
Start both the Indexing Service and Activity Processing services.
- Either from GCC with the services start command or from the Design page using the "Machines" node.

Search Classifier: More deterministic than LLM Classifier

The Search Classifier method does not use large language models (LLMs) or "generative AI". It only uses embeddings models to collect vector embeddings for each document. Classification is based solely on vector similarity to indexed content, making results predictable, repeatable, and easy to audit.

Embeddings models, are "deterministic".

They produce the same output every time, given the same input under the same conditions (i.e. using the same model).
The same vectors will be collected from the same document as long as the same embeddings model is used every time it runs on a document.
Vector comparisons are similarly deterministic. The equations used to measure the similarities between vectors on document A and vectors on document B are going to be the same every time that similarity score is measured.

Large language models (and generative AI as a whole) are 'not deterministic.

Generative models use sampling techniques to produce varied outputs.
While it's likely an LLM will generate similar responses to the same input, there is no guarantee the response will be identical every time it runs on a document.

Knowing this, you can contrast the "Search Classifier" method with the "LLM Classifier" method.

The Search Classifier method uses embeddings models, vector comparison, and a search index to determine a document's similarity to documents already present in a search index. An unclassified document is given the same Document Type as its closest match in the search index.
- Search Classifier is deterministic.
- It requires an embeddings model and an AI Search index to operate.
The LLM Classifier method uses a large language model to choose from a list of possible Document Types. The LLM is handed the document, the list of Document Types and is (more or less) asked the question "What Document Type is this?". The LLM then generates a response, from which Grooper assigns the document a Document Type.
- LLM Classifier is not deterministic (it is generative.)
- It requires only a large language model to operate.

Related concepts

Content Model: The root object in Grooper that organizes Document Types and classification logic.
Document Type: The object in Grooper that represents a distinct kind of document. Critical for document classification and extraction.
Content Type: The object assigned by classification. Most typically this is a Document Type.
Batch Folder: The item being classified.
Classify: The activity that applies classification to Batch Folders.
Indexing Behavior: The configuration that enables vector search and connects to your search index.