Clause Detection (Section Extract Method)

This article is about the current version of Grooper.

Note that some content may still need to be updated.

2025

Clause Detection is a insert_page_break Data Section Extract Method. It leverages LLM text embedding models to compare supplied samples of text against the text of a document to return what the AI determines is the "chunk" of text that most closely resembles the supplied samples.

You may download the ZIP(s) below and upload it into your own Grooper environment (version 2024). The first contains one or more Batches of sample documents. The second contains one or more Projects with resources used in examples throughout this article.

About

This setting exposes a set of properties related to text embedding chunks. Perhaps you are looking for the existance of a specific type of clause within a contractual legal document. With this method you can provide the model an example (or several) of the type of clause in question. These examples are then compared against the embedded "chunks" of text of the document, instead of the entire text of the document.

Model

This property is a selection of text embedding models. Text embedding models convert text into numerical vectors that capture semantic meaning. These vectors represent words, sentences, or documents in a high-dimensional space, preserving contextual relationships. They are beneficial because they enable efficient comparison, clustering, and retrieval of text data, improve the performance of natural language processing tasks like classification and translation, and facilitate understanding and generation of human language by machines.

Text embedding models are beneficial for pricing token consumption with LLMs like ChatGPT because they create compact, efficient representations of text. This reduces the number of tokens required to process and understand the input, lowering costs. Additionally, embeddings allow for effective semantic search and context management, minimizing unnecessary token usage and optimizing interactions for cost-efficiency.

Please visit this article for a more comprehensive understanding of the following information.

text-embedding-3-large

Size and Complexity

As indicated by "large", this is a large model, which means it has more parameters compared to smaller models. This generally leads to better performance in generating embeddings but requires more computational resources.

Performance

Typically, larger models capture more intricate patterns and relationships in the data, leading to higher quality embeddings.

Use Cases

Suitable for applications where the highest possible embedding quality is required, and computational resources are available to support the larger model size.

text-embedding-3-small

Size and Complexity

As indicated by "small", this is a smaller version of the model, with fewer parameters compared to the "large" version.

Performance

While it may not capture as many intricate details as the large model, it still provides good quality embeddings but with less computational overhead.

Use Cases

Ideal for scenarios where computational resources are limited or where faster inference times are needed, and a slight compromise on embedding quality is acceptable.

text-embedding-ada-002

Size and Complexity

"ada" generally indicates a specific architecture within the OpenAI models. The "002" version suggests it's an iteration or version of the ada architecture.

Performance

This model balances performance and efficiency, often tailored for a mix of good quality embeddings and reasonable computational requirements.

Use Cases

Often used for a broad range of applications where a balance between computational efficiency and embedding quality is desired.

In summary:

text-embedding-3-large: Best quality embeddings, high computational requirements.
text-embedding-3-small: Good quality embeddings, lower computational requirements.
text-embedding-ada-002: Balanced approach, good for general-purpose use cases.

The choice among these models depends on the specific needs of your application, including the quality of embeddings required and the available computational resources.

Queries

This property is a collection of one or more natural language queries to be executed against the document. Simply put, these are the examples given to the AI to compare against. So in the case of a looking for a type of clause, this would be an example of that clause. You may provide more than one example, and if so, the highest-scoring "example" (query) will be used.

Preprocessing

Please visit the Preprocessing article for more information.

Chunking

This property is a collection of properties that control document chunking. "Chunking" in the context of text processing refers to dividing a large piece of text into smaller segments or "chunks." This can be useful for various natural language processing tasks, such as text embedding, where processing the entire text at once may not be feasible or efficient.

Offset

This property refers to the starting position of a chunk relative to the original text. When chunking text, you may define offsets to specify where each chunk begins. This helps in keeping track of the position of each chunk within the original text.

Overlap

This property refers to the amount of text that is shared between consecutive chunks. Overlapping can help ensure that important contextual information is preserved across chunks, which can be particularly useful in tasks like embedding, where the context around a piece of text can be important for understanding its meaning.

Chunk Size

This property refers to the length of each chunk, typically defined in terms of the number of words, sentences, or characters. Choosing an appropriate chunk size is crucial, as too small chunks may lose context, while too large chunks may be difficult to process efficiently.

Example

Imagine you have a long document that you want to chunk for embedding:

Original Text: "This is a sample document. It contains several sentences. We will chunk this document for processing."
Chunk Size: 10 words
Overlap: 5 words

Chunks

Chunk 1: "This is a sample document. It contains several"
Chunk 2: "document. It contains several sentences. We will"
Chunk 3: "sentences. We will chunk this document for"
Chunk 4: "chunk this document for processing."

In this eample:

Offset: The position of each chunk in the original text.
Overlap: 5 words are shared between consecutive chunks to maintain context.
Chunk Size: Each chunk contains up to 10 words.

Benefits of Chunking with Offset and Overlap

Context Preservation: Overlapping chunks ensure that important context is not lost between chunks.
Efficiency: Smaller chunks can be processed more efficiently, making it feasible to handle larger documents.
Flexibility: Adjusting chunk size and overlap allows for fine-tuning based on the specific needs of the task.

By carefully setting the chunk size, offset, and overlap, you can optimize the chunking process for your particular application, ensuring that important information is retained while making the text manageable for processing.

Paragraph Padding

This property specifices the number of adjacent paragraphs to include in the query provided to the AI. A value of 0 will pad the quote to include all paragraphs which overlap the matching chunk. Higher values indicate the number of paragraphs before and after the overlapping paragraphs.

How To

In the following walkthrough we are going to setup Clause Detection on the Data Section of a provided project. The Data Section being a "container" of several descendant Data Fields will leverage the AI Extract Fill Method to collect the data for those fields.

The document we will be extracting against from the provided Batch consists of several pages with quite a few words in total. Given that fact, it would be costly to run AI Extract against the entire text of the document. To solve this problem we will use the Data Section for one of its key functions which is to define a subset of data within the document. In so doing we will drastically reduce the amount of text given to the LLM AI and as such greately reduce the tokens consumed and the time taken to run extraction.

Because we do not know the exact wording of the clause we will define as the structure of our Data Section it can prove quite challenging to attempt to define the structure of the Data Section via pattern matching. This is where Clause Detection will come into play. We can provide a sample of what the language of the clause we are looking for may be like. This sample will be leveraged within a text embeddings model (which we learned above is faster and cheaper than standard chatbot queries) to find a clause within the text of the document that is of high similarity to the sample.

In so doing we will not only be leveraging AI to easily extract the data we are after, but we will also be using AI to make using AI more efficient.

Select the "Granting Clause" Data Section from the provided Project.
Click the drop-down for the Extract Method property.
Select Clause Detection from the drop-down menu.

Expand the sub-properties and click the ellipsis button for the Model property.
In the "Model" window select text-embedding-3-large. Feel free to experiment with the other models.

Click the ellipsis button on the Queries property.
In the "Queries" window click the "Add" button.
This will add an entry to the "Sample Content".
Click the ellipsis button on the Sample Content property.

In the "Sample Content" window add the provided sample clause.

AGREEMENT, Made and entered into this [Effective Date] , by and between[Lessor Name] whose address is [Lessor Address] hereinafter called Lessor and [Lessee Name] whose address is [Lessee Address] hereinafter called Lessee.Lessor hereby grants, leases, and lets unto Lessee, for the purpose of investigating, exploring, drilling, developing, and producing oil, gas, and other hydrocarbons, and storing, handling, and transporting the same, all the oil and gas rights and interests in and under the land described as follows: [Legal Description of Property], containing approximately [Number of Acres] acres, more or less (hereinafter referred to as the "Leased Premises").

Click the "Tester" tab.
Be sure to select the document from the supplied Batch in the Batch Viewer.
Click the "Test" button.
View the extracted results in the Data Model Preview and see the highlighting in the Document Viewer".