GPT Embeddings (Classification Method)

From Grooper Wiki
(Redirected from GPT Embeddings)

!!

LEGACY TECHNOLOGY DETECTED!!

GPT Embeddings is now obsolete. The Search Classifier method is an improved version of this Classification Method.

The GPT Embeddings method will be removed from Grooper in version 2025.

GPT Embeddings is a Classification Method that uses an OpenAI embeddings model and trained document samples to tell one document from another.

GPT Embeddings should be considered a BETA feature.

  • This feature was recently added by the development team without a specific use case in mind.
  • Rather, it was developed in response to ChatGPT's growing popularity.
  • While it should work in theory, with no specific use case originating the feature, it has not been extensively tested.
  • As new use cases emerge that are suited for this feature, this section's documentation will be expanded.

An embedding is a vector (list) of numbers. You can determine the difference between embeddings based on the distance between their vectors. A small distance between embeddings suggests they are highly related. A low distance between the embeddings suggests they are less related.

When using GPT Embeddings to classify documents, you will train the Content Model by giving Grooper example documents for each Document Type. The GPT model will assign the Document Types embeddings based on the text content from each trained document. When documents are classified (using the Classify activity), embeddings from the unclassified document are compared to the trained embedding values for each Document Type. Documents are then assigned the Document Type with the most similar embeddings.

For more information on embeddings, visit the following OpenAI documentation:

Please be aware embeddings have a maximum number of input tokens per request. This means there is a cutoff point for longer documents. How many input tokens are available depends on the GPT model you're using.

  • OpenAI recommends using the "text-embedding-ada-002" model for embeddings.
  • This model has 8191 maximum input tokens available.