Clause Detection (Section Extract Method): Difference between revisions

From Grooper Wiki
No edit summary
Line 9: Line 9:
You may download the ZIP(s) below and upload it into your own Grooper environment (version 2024). The first contains one or more '''Batches''' of sample documents.  The second contains one or more '''Projects''' with resources used in examples throughout this article.  
You may download the ZIP(s) below and upload it into your own Grooper environment (version 2024). The first contains one or more '''Batches''' of sample documents.  The second contains one or more '''Projects''' with resources used in examples throughout this article.  
* [[Media:2024_Wiki_Clause-Detection_Batch.zip]]
* [[Media:2024_Wiki_Clause-Detection_Batch.zip]]
* [[Media:2024_Wiki_Clause-Detection.zip]]
* [[Media:2024_Wiki_Clause-Detection_Project.zip]]
|}
|}



Revision as of 13:19, 31 July 2024

This article is about the current version of Grooper.

Note that some content may still need to be updated.

2025

Clause Detection is a insert_page_break Data Section Extract Method. It leverages LLM text embedding models to compare supplied samples of text against the text of a document to return what the AI determines is the "chunk" of text that most closely resembles the supplied samples.

You may download the ZIP(s) below and upload it into your own Grooper environment (version 2024). The first contains one or more Batches of sample documents. The second contains one or more Projects with resources used in examples throughout this article.

Glossary

About


How To

In the following walkthrough we are going to setup Clause Detection on the Data Section of a provided project. The Data Section being a "container" of several descendant Data Fields will leverage the AI Extract Fill Method to collect the data for those fields.

The document we will be extracting against from the provided Batch consists of several pages with quite a few words in total. Given that fact, it would be costly to run AI Extract against the entire text of the document. To solve this problem we will use the Data Section for one of its key functions which is to define a subset of data within the document. In so doing we will drastically reduce the amount of text given to the LLM AI and as such greately reduce the tokens consumed and the time taken to run extraction.

Because we do not know the exact wording of the clause we will define as the structure of our Data Section it can prove quite challenging to attempt to define the structure of the Data Section via pattern matching. This is where Clause Detection will come into play. We can provide a sample of what the language of the clause we are looking for may be like. This sample will be leveraged within a text embeddings model (which we learned above is faster and cheaper than standard chatbot queries) to find a clause within the text of the document that is of high similarity to the sample.

In so doing we will not only be leveraging AI to easily extract the data we are after, but we will also be using AI to make using AI more efficient.