2024:AI Extract (Fill Method)

From Grooper Wiki
Revision as of 08:03, 26 August 2024 by Randallkinard (talk | contribs)

2025 BETA

This article covers new or changed functionality in the current or upcoming beta version of Grooper. Features are subject to change before version 2025's GA release. Configuration and functionality may differ from later beta builds and the final 2025 release.

AI Extract is a Fill Method that leverages a Large Language Model (LLM) to return extraction results to Data Elements in a data_table Data Model or insert_page_break Data Section. This mechanism provides powerful AI-based data extraction with minimal setup.

You may download the ZIP(s) below and upload it into your own Grooper environment (version 2024). The first contains one or more Batches of sample documents. The second contains one or more Projects with resources used in examples throughout this article.

About

The goal of AI Extract is to extract data by simply describing the data to be extracted as a result of the "self-descriptive" nature of the "container" Data Elements (Data Model, Data Section, or Data Table) itself. This fill method presents an AI chatbot with all or part of the document content, and asks it to generate JSON data matching the structure of descendant Data Elements. The JSON data is then used to populate descendant Data Sections, Data Tables, and Data Fields.

More Data. Less Work.

The Big Idea

Use an LLM to extract data by simply describing the data to be extracted.

  • Let the Data Model be self-descriptive.
  • The AI should “know” what to extract based on how Data Elements in the Data Model are named.
  • Allows Grooper to extract a Data Model with a fraction of the set up.
  • No need for hand-crafted extractors.


AI Extract: The First Fill Method

AI Extract is the first Fill Method in Grooper.

  • Fill Methods can be configured on “container” Data Elements
    • Data Model, Data Section, Data Table
    • Be aware: AI Extract currently has no known use on a Data Table.
  • Fill Methods are secondary extraction operations which populate child Data Elements. They run after extraction.
  • Ex: A Fill Method is set on a Data Model.
    • First, all its child Data Elements run their configured Value Extractors and extract methods.
    • Then, the Fill Methods runs. It can be configured to overwrite extraction results or only fill blank fields.


Make LLMs Do The Work

  • Minimal configuration is required to start getting data back.
  • In many cases, well-named Data Elements are all that is required.
  • If necessary, Descriptions are added to Data Elements as needed to specify special instructions.
  • How it works:
    • Prompts the LLM with some document content.
    • Asks the LLM to generate JSON matching the Data Element structure.
    • Parses the resulting JSON data into sections, tables, and fields.


AI Extract Pros and Cons

Understanding the benefits and drawbacks of AI Extract can help you determine if using it is right for your needs.

Pros

  • Instant time to value
  • Ease of use is off the charts. Just define the Data Model and go.
  • Fills in a full Data Model with fewer API calls than Ask AI
  • Has some result highlighting (defined by “Alignment” settings)

Cons

  • LLM responses can be unpredictable
  • LLM responses can be inaccurate
  • Easy to become over-confident when using this. There are always errors when you look closely
  • Larger documents pose more of a challenge to fully extract
  • Slower than traditional Grooper extractors (particularly for larger documents)
  • Result highlighting is not perfect

How To

The following walkthrough will use an invoice as part of the supplied materials. Invoices are easy to understand and the setup too, will be easy.

HOWEVER, PLEASE BE AWARE:

  1. LLM chatbots are typically best suited for documents with natural language flow such as legal contracts. We will use invoices for this example. Invoices do not represent documents with natural language flow, and instead are a type of structured data document. However, if a structured document does use natural language to describe information (such as labels next to a value), Grooper can get usable data using LLM based extraction methods like AI Extract. However, it may not be able to do so for all structured and semi-structured documents.
  2. LLM chatbots do not always return reliable information as they inherently do not "understand" information. They simply predict the most likely response, one word after the other. This can lead to innacurate responses. LLM-based extraction methods will typically provide quick results with little setup but may not be 100% accurate.
  3. Reliance on accurate information will typically involve human review to verify data integrity. An important aspect of reviewing data in Grooper is the highlighting of returned results in the Document Viewer. At best it is difficult to accurately highlight data from a chatbot as there is not technically any character coordinate information being returned from the chat bot as part of its delivered data. There are some settings of properties in the configuration of AI Extract that can aid in this highlighting, but it isn't perfect. At worst, it can be impossible at times to highlight data from a chat bot at all.

Establish an LLM Connector

First, we need to establish an LLM Connector within the Options property on the Root object.

Please visit the LLM Connector article for more information.

Configure AI Extract Fill Method

With an LLM Connector established we can now configure our Data Model with the AI Extract Fill Method.

  1. Select the Data Model from the provided Project.
  2. Click the ellipsis button on the Fill Methods property.
  3. Click the "Add" button in the "Fill Methods" window.
  4. Choose "AI Extract" from the drop-down menu.


  1. Click the ellipsis button on the Model property of the newly added AI Extract Fill Method.
  2. Select "gpt-4o" in the "Model" window. As of the writing of this article, it is the most accurate model, and the best choice.
    • Feel free to experiment with the other models to test results.


  1. In the Parameters property group, lower the Temperature property to 0.2. This can help the AI be less "creative" with its responses.
    • You may want to go as low as 0 to completely eliminate "creativity".
    • Please see the Parameters article for more information.


  1. Click the ellipsis button on the Instructions property.
  2. Write a prompt in the "Instructions" window.
    • Prompts written here should be considered as "global" prompts for the entire model, not specific to individual Data Elements.


  1. If you want to choose specific elements to extract you can do so. Click the ellipsis button on the Included Elements property.
  2. In the "Included Elements" window you can choose specific elements.
    • Leaving this property default, or blank, will consider all Data Elements. Choosing specific Data Elements will only include those selected as part of the prompt given to the AI.


  1. Document Quoting controls the text fed to the AI.
  2. Preprocessing controls the text supplied to the AI by adding or removing control characters.
  3. ... Alignment controls the highlighting of results in the Data Model Preview.
  4. Please see the Alignment article for more information.

Adjustments and Results

Let's now take a look at how using the Description property on Data Elements can be leveraged as prompts to give us better results and see the final output of extraction.

  1. Select the "LineItems" Data Table.
  2. Click the ellipsis button on the Description property.
  3. Add instructions in the "Description" window.
This is the table for the line items ordered on the invoice.


  1. Select the "custPo" Data Field.
  2. Click the ellipsis button on the Description property.
  3. Add instructions in the "Description" window.
If the "Customer P.O." has a result of "SEE BELOW", use the "AFE #" instead.

  1. Click on the Data Model.
  2. Click on the "Tester" tab.
  3. Click the "Test" button.
  4. Notice successful extracted results in the Data Model Preview.

Glossary

AI Extract: AI Extract is a Fill Method that leverages a Large Language Model (LLM) to return extraction results to Data Elements in a data_table Data Model or insert_page_break Data Section. This mechanism provides powerful AI-based data extraction with minimal setup.

Alignment: "Alignment" refers to how Grooper highlights text from an AI response on a document in a Document Viewer. Alignment properties can be configured to alter how Grooper highlights results when using LLM-based extraction methods, such as AI Extract.

Data Element: Data Elements are a class of node types used to collect data from a document. These include: data_table Data Models, insert_page_break Data Sections, variables Data Fields, table Data Tables, and view_column Data Columns.

Data Field: variables Data Fields represent a single value targeted for data extraction on a document. Data Fields are created as child nodes of a data_table Data Model and/or insert_page_break Data Sections.

  • Data Fields are frequently referred to simply as "fields".

Data Model: data_table Data Models are leveraged during the Extract activity to collect data from documents (folder Batch Folders). Data Models are the root of a Data Element hierarchy. The Data Model and its child Data Elements define a schema for data present on a document. The Data Model's configuration (and its child Data Elements' configuration) define data extraction logic and settings for how data is reviewed in a Data Viewer.

Data Section: A insert_page_break Data Section is a container for Data Elements in a data_table Data Model. variables They can contain Data Fields, table Data Tables, and even Data Sections as child nodes and add hierarchy to a Data Model. They serve two main purposes:

  1. They can simply act as organizational buckets for Data Elements in larger Data Models.
  2. By configuring its "Extract Method", a Data Section can subdivide larger and more complex documents into smaller parts to assist in extraction.
    • "Single Instance" sections define a division (or "record") that appears only once on a document.
    • "Multi-Instance" sections define collection of repeating divisions (or "records").

Data Table: A table Data Table is a Data Element specialized in extracting tabular data from documents (i.e. data formatted in rows and columns).

  • The Data Table itself defines the "Table Extract Method". This is configured to determine the logic used to locate and return the table's rows.
  • The table's columns are defined by adding view_column Data Column nodes to the Data Table (as its children).

Document Quoting:

Document Viewer: The Grooper Document Viewer is the portal to your documents. It is the UI that allows you to see a folder Batch Folder's (or a contract Batch Page's) image, text content, and more.

Extract: export_notes Extract is an Activity that retrieves information from folder Batch Folder documents, as defined by Data Elements in a data_table Data Model. This is how Grooper locates unstructured data on your documents and collects it in a structured, usable format.

Fill Method: Fill Methods provide various mechanisms for populating child Data Elements of a data_table Data Model, insert_page_break Data Section or table Data Table. Fill Methods can be added to these nodes using their "Fill Methods" property and editor.

  • Fill Methods are secondary extraction operations. They populate descendant Data Elements after normal extraction when the export_notes Extract activity runs.

LLM Connector: LLM Connector is a Repository Option that enables large language model (LLM) powered AI features for a Grooper Repository.

Parameters: Parameters is a collection of properties used in the configuration of LLM constructs. Temperature, TopP, Presence Penalty, and Frequency Penalty are parameters that influence text generation in models. Temperature and TopP control the diversity and probability distribution of generated text, while Presence Penalty and Frequency Penalty help manage repetition by discouraging the reuse of words or phrases.

Preprocessing:

Project: package_2 Projects are the primary containers for configuration nodes within Grooper. The Project is where various processing objects such as stacks Content Models, settings Batch Processes, profile objects are stored. This makes resources easier to manage, easier to save, and simplifies how node references are made in a Grooper Repository.

Root: The Grooper database Root node is the topmost element of the Grooper Repository. All other nodes in a Grooper Repository are its children/descendants. The Grooper Root also stores several settings that apply to the Grooper Repository, including the license serial number or license service URL and Repository Options.