Clause Detection (Section Extract Method): Difference between revisions

Revision as of 08:57, 3 August 2024

2026 BETA

This article covers new or changed functionality in the current or upcoming beta version of Grooper. Features are subject to change before version 2026's GA release. Configuration and functionality may differ from later beta builds and the final 2026 release.

Clause Detection is a insert_page_break Data Section Extract Method. It leverages LLM text embedding models to compare supplied samples of text against the text of a document to return what the AI determines is the "chunk" of text that most closely resembles the supplied samples.

You may download the ZIP(s) below and upload it into your own Grooper environment (version 2024). The first contains one or more Batches of sample documents. The second contains one or more Projects with resources used in examples throughout this article.

Glossary

AI Extract: AI Extract is a Fill Method that leverages a Large Language Model (LLM) to return extraction results to Data Elements in a data_table Data Model or insert_page_break Data Section. This mechanism provides powerful AI-based data extraction with minimal setup.

Batch: inventory_2 Batch nodes are fundamental in Grooper's architecture. They are containers of documents that are moved through workflow mechanisms called settings Batch Processes. Documents and their pages are represented in Batches by a hierarchy of folder Batch Folders and contract Batch Pages.

Data Field: variables Data Fields represent a single value targeted for data extraction on a document. Data Fields are created as child nodes of a data_table Data Model and/or insert_page_break Data Sections.

Data Fields are frequently referred to simply as "fields".

Data Model: data_table Data Models are leveraged during the Extract activity to collect data from documents (folder Batch Folders). Data Models are the root of a Data Element hierarchy. The Data Model and its child Data Elements define a schema for data present on a document. The Data Model's configuration (and its child Data Elements' configuration) define data extraction logic and settings for how data is reviewed in a Data Viewer.

Data Section: A insert_page_break Data Section is a container for Data Elements in a data_table Data Model. variables They can contain Data Fields, table Data Tables, and even Data Sections as child nodes and add hierarchy to a Data Model. They serve two main purposes:

They can simply act as organizational buckets for Data Elements in larger Data Models.
By configuring its "Extract Method", a Data Section can subdivide larger and more complex documents into smaller parts to assist in extraction.
- "Single Instance" sections define a division (or "record") that appears only once on a document.
- "Multi-Instance" sections define collection of repeating divisions (or "records").

Document Viewer: The Grooper Document Viewer is the portal to your documents. It is the UI that allows you to see a folder Batch Folder's (or a contract Batch Page's) image, text content, and more.

Extract: export_notes Extract is an Activity that retrieves information from folder Batch Folder documents, as defined by Data Elements in a data_table Data Model. This is how Grooper locates unstructured data on your documents and collects it in a structured, usable format.

Fill Method: Fill Methods provide various mechanisms for populating child Data Elements of a data_table Data Model, insert_page_break Data Section or table Data Table. Fill Methods can be added to these nodes using their "Fill Methods" property and editor.

Fill Methods are secondary extraction operations. They populate descendant Data Elements after normal extraction when the export_notes Extract activity runs.

Preprocessing:

Project: package_2 Projects are the primary containers for configuration nodes within Grooper. The Project is where various processing objects such as stacks Content Models, settings Batch Processes, profile objects are stored. This makes resources easier to manage, easier to save, and simplifies how node references are made in a Grooper Repository.

Section Extract Method: The Extract Method property of a insert_page_break Data Section defines a "Section Extract Method" which specifies how section instances will be identified and extracted.

About

How To

In the following walkthrough we are going to setup Clause Detection on the Data Section of a provided project. The Data Section being a "container" of several descendant Data Fields will leverage the AI Extract Fill Method to collect the data for those fields.

The document we will be extracting against from the provided Batch consists of several pages with quite a few words in total. Given that fact, it would be costly to run AI Extract against the entire text of the document. To solve this problem we will use the Data Section for one of its key functions which is to define a subset of data within the document. In so doing we will drastically reduce the amount of text given to the LLM AI and as such greately reduce the tokens consumed and the time taken to run extraction.

Because we do not know the exact wording of the clause we will define as the structure of our Data Section it can prove quite challenging to attempt to define the structure of the Data Section via pattern matching. This is where Clause Detection will come into play. We can provide a sample of what the language of the clause we are looking for may be like. This sample will be leveraged within a text embeddings model (which we learned above is faster and cheaper than standard chatbot queries) to find a clause within the text of the document that is of high similarity to the sample.

In so doing we will not only be leveraging AI to easily extract the data we are after, but we will also be using AI to make using AI more efficient.

Select the "Granting Clause" Data Section from the provided Project.
Click the drop-down for the Extract Method property.
Select Clause Detection from the drop-down menu.

Expand the sub-properties and click the ellipsis button for the Model property.
In the "Model" window select text-embedding-3-large. Feel free to experiment with the other models.

Click the ellipsis button on the Queries property.
In the "Queries" window click the "Add" button.
This will add an entry to the "Sample Content".
Click the ellipsis button on the Sample Content property.

In the "Sample Content" window add the provided sample clause.

AGREEMENT, Made and entered into this [Effective Date] , by and between[Lessor Name] whose address is [Lessor Address] hereinafter called Lessor and [Lessee Name] whose address is [Lessee Address] hereinafter called Lessee.Lessor hereby grants, leases, and lets unto Lessee, for the purpose of investigating, exploring, drilling, developing, and producing oil, gas, and other hydrocarbons, and storing, handling, and transporting the same, all the oil and gas rights and interests in and under the land described as follows: [Legal Description of Property], containing approximately [Number of Acres] acres, more or less (hereinafter referred to as the "Leased Premises").

Click the "Tester" tab.
Be sure to select the document from the supplied Batch in the Batch Viewer.
Click the "Test" button.
View the extracted results in the Data Model Preview and see the highlighting in the Document Viewer".

Revision as of 14:16, 31 July 2024 view source Randallkinard (talk \| contribs) Administrators 5,803 edits →‎Glossary ← Older edit		Revision as of 08:57, 3 August 2024 view source Randallkinard (talk \| contribs) Administrators 5,803 edits No edit summary Newer edit →
Line 1:		Line 1:
	{{~~AutoVersion~~}}		{{beta}}

	<blockquote>{{#lst:Glossary\|Clause Detection}}</blockquote>		<blockquote>{{#lst:Glossary\|Clause Detection}}</blockquote>