Document Quoting (Property)

This article is about the current version of Grooper.

Note that some content may still need to be updated.

2025

Document Quoting is a property of the AI Extract Fill Method that limits the text fed to the AI to reduce the amount of tokens consumed. Controlling specifically what is given can not only reduce the monetary cost of using the AI, but also the time cost of running the Fill Method.

Data Values

This setting will supply the AI with extracted docuemnt data in JSON format. Consider first that AI Extract is always performed after normal Grooper extraction. Because of this the document will now have a "DocumentData.json" file attached to it. A standard Grooper Extractor Type can now be run against this JSON information, instead of the full text of the document.

Extracted

This setting allows the use of a standard Grooper Extractor Type which will define a specific string of text from the document. The AI will be fed only the returned result of the Extractor Type instead of the full text of the document.

Semantic

This setting exposes a set of properties related to text embedding chunks. Perhaps you are looking for the existance of a specific type of clause within a contractual legal document. With this method you can provide the model an example (or several) of the type of clause in question. These examples are then compared against the embedded "chunks" of text of the document, instead of the entire text of the document.

Model

This property is a selection of text embedding models. Text embedding models convert text into numerical vectors that capture semantic meaning. These vectors represent words, sentences, or documents in a high-dimensional space, preserving contextual relationships. They are beneficial because they enable efficient comparison, clustering, and retrieval of text data, improve the performance of natural language processing tasks like classification and translation, and facilitate understanding and generation of human language by machines.

Text embedding models are beneficial for pricing token consumption with LLMs like ChatGPT because they create compact, efficient representations of text. This reduces the number of tokens required to process and understand the input, lowering costs. Additionally, embeddings allow for effective semantic search and context management, minimizing unnecessary token usage and optimizing interactions for cost-efficiency.

Please visit this article for a more comprehensive understanding of the following information.

text-embedding-3-large

Size and Complexity

As indicated by "large", this is a large model, which means it has more parameters compared to smaller models. This generally leads to better performance in generating embeddings but requires more computational resources.

Performance

Typically, larger models capture more intricate patterns and relationships in the data, leading to higher quality embeddings.

Use Cases

Suitable for applications where the highest possible embedding quality is required, and computational resources are available to support the larger model size.

text-embedding-3-small

Size and Complexity

As indicated by "small", this is a smaller version of the model, with fewer parameters compared to the "large" version.

Performance

While it may not capture as many intricate details as the large model, it still provides good quality embeddings but with less computational overhead.

Use Cases

Ideal for scenarios where computational resources are limited or where faster inference times are needed, and a slight compromise on embedding quality is acceptable.

text-embedding-ada-002

Size and Complexity

"ada" generally indicates a specific architecture within the OpenAI models. The "002" version suggests it's an iteration or version of the ada architecture.

Performance

This model balances performance and efficiency, often tailored for a mix of good quality embeddings and reasonable computational requirements.

Use Cases

Often used for a broad range of applications where a balance between computational efficiency and embedding quality is desired.

In summary:

text-embedding-3-large: Best quality embeddings, high computational requirements.
text-embedding-3-small: Good quality embeddings, lower computational requirements.
text-embedding-ada-002: Balanced approach, good for general-purpose use cases.

The choice among these models depends on the specific needs of your application, including the quality of embeddings required and the available computational resources.

Queries

This property is a collection of one or more natural language queries to be executed against the document. Simply put, these are the examples given to the AI to compare against. So in the case of a looking for a type of clause, this would be an example of that clause. You may provide more than one example, and if so, the highest-scoring "example" (query) will be used.

Preprocessing

Please visit the Preprocessing article for more information.

Chunking

This property is a collection of properties that control document chunking. "Chunking" in the context of text processing refers to dividing a large piece of text into smaller segments or "chunks." This can be useful for various natural language processing tasks, such as text embedding, where processing the entire text at once may not be feasible or efficient.

Offset

This property refers to the starting position of a chunk relative to the original text. When chunking text, you may define offsets to specify where each chunk begins. This helps in keeping track of the position of each chunk within the original text.

Overlap

This property refers to the amount of text that is shared between consecutive chunks. Overlapping can help ensure that important contextual information is preserved across chunks, which can be particularly useful in tasks like embedding, where the context around a piece of text can be important for understanding its meaning.

Chunk Size

This property refers to the length of each chunk, typically defined in terms of the number of words, sentences, or characters. Choosing an appropriate chunk size is crucial, as too small chunks may lose context, while too large chunks may be difficult to process efficiently.

Example

Imagine you have a long document that you want to chunk for embedding:

Original Text: "This is a sample document. It contains several sentences. We will chunk this document for processing."
Chunk Size: 10 words
Overlap: 5 words

Chunks

Chunk 1: "This is a sample document. It contains several"
Chunk 2: "document. It contains several sentences. We will"
Chunk 3: "sentences. We will chunk this document for"
Chunk 4: "chunk this document for processing."

In this eample:

Offset: The position of each chunk in the original text.
Overlap: 5 words are shared between consecutive chunks to maintain context.
Chunk Size: Each chunk contains up to 10 words.

Benefits of Chunking with Offset and Overlap

Context Preservation: Overlapping chunks ensure that important context is not lost between chunks.
Efficiency: Smaller chunks can be processed more efficiently, making it feasible to handle larger documents.
Flexibility: Adjusting chunk size and overlap allows for fine-tuning based on the specific needs of the task.

By carefully setting the chunk size, offset, and overlap, you can optimize the chunking process for your particular application, ensuring that important information is retained while making the text manageable for processing.

Paragraph Padding

This property specifices the number of adjacent paragraphs to include in the query provided to the AI. A value of 0 will pad the quote to include all paragraphs which overlap the matching chunk. Higher values indicate the number of paragraphs before and after the overlapping paragraphs.