GPT Embeddings (Classify Method)

From Grooper Wiki

This article is about the current version of Grooper.

Note that some content may still need to be updated.

2025

STUB

This article is a stub. It contains minimal information on the topic and should be expanded.

BE AWARE: GPT Embeddings is obsolete as of version 2025. The LLM Classifier and Search Classifier methods are the new and improved AI-enabled classification methods. GPT Embeddings is a Classify Method that uses an OpenAI embeddings model and trained document samples to tell one document from another.

GPT Embeddings should be considered a BETA feature.

  • This feature was recently added by the development team without a specific use case in mind.
  • Rather, it was developed in response to ChatGPT's growing popularity.
  • While it should work in theory, with no specific use case originating the feature, it has not been extensively tested.
  • As new use cases emerge that are suited for this feature, this section's documentation will be expanded.

An embedding is a vector (list) of numbers. You can determine the difference between embeddings based on the distance between their vectors. A small distance between embeddings suggests they are highly related. A low distance between the embeddings suggests they are less related.

When using GPT Embeddings to classify documents, you will train the Content Model by giving Grooper example documents for each Document Type. The GPT model will assign the Document Types embeddings based on the text content from each trained document. When documents are classified (using the Classify activity), embeddings from the unclassified document are compared to the trained embedding values for each Document Type. Documents are then assigned the Document Type with the most similar embeddings.

For more information on embeddings, visit the following OpenAI documentation:

Please be aware embeddings have a maximum number of input tokens per request. This means there is a cutoff point for longer documents. How many input tokens are available depends on the GPT model you're using.

  • OpenAI recommends using the "text-embedding-ada-002" model for embeddings.
  • This model has 8191 maximum input tokens available.

Glossary

Classification Method:

Classification: Classification is the process of identifying and organizing documents into categorical types based on their content or layout. Classification is key for efficient document management and data extraction workflows. Grooper has different methods for classifying documents. These include methods that use machine learning and text pattern recognition. In a Grooper Batch Process, the Classify Activity will assign a Content Type to a folder Batch Folder.

Classify: unknown_document Classify is an Activity that "classifies" folder Batch Folders in a inventory_2 Batch by assigning them a description Document Type.

  • Classification is key to Grooper's document processing. It affects how data is extracted from a document (during the Extract activity) and how Behaviors are applied.
  • Classification logic is controlled by a Content Model's "Classify Method". These methods include using text patterns, previously trained document examples, and Label Sets to identify documents.

Content Model: stacks Content Model nodes define a classification taxonomy for document sets in Grooper. This taxonomy is defined by the collections_bookmark Content Categories and description Document Types they contain. Content Models serve as the root of a Content Type hierarchy, which defines Data Element inheritance and Behavior inheritance. Content Models are crucial for organizing documents for data extraction and more.

Document Type: description Document Type nodes represent a distinct type of document, such as an invoice or a contract. Document Types are created as child nodes of a stacks Content Model or a collections_bookmark Content Category. They serve three primary purposes:

  1. They are used to classify documents. Documents are considered "classified" when the folder Batch Folder is assigned a Content Type (most typically, a Document Type).
  2. The Document Type's data_table Data Model defines the Data Elements extracted by the Extract activity (including any Data Elements inherited from parent Content Types).
  3. The Document Type defines all "Behaviors" that apply (whether from the Document Type's Behavior settings or those inherited from a parent Content Type).

GPT Embeddings: BE AWARE: GPT Embeddings is obsolete as of version 2025. The LLM Classifier and Search Classifier methods are the new and improved AI-enabled classification methods. GPT Embeddings is a Classify Method that uses an OpenAI embeddings model and trained document samples to tell one document from another.