AI Extract (Fill Method)

From Grooper Wiki
(Redirected from AI Extract)

This article was migrated from an older version and has not been updated for the current version of Grooper.

This tag will be removed upon article review and update.

This article is about the current version of Grooper.

Note that some content may still need to be updated.

20252024

AI Extract is a Fill Method that leverages a Large Language Model (LLM) to return extraction results to Data Elements in a data_table Data Model or insert_page_break Data Section. This mechanism provides powerful AI-based data extraction with minimal setup.

NOTE: There is no known use for configuring AI Extract on a Data Table at this time. However, if the Data Table is a child of a Data Section or Data Model, AI Extract will still attempt to populate it.

You may download the ZIP(s) below and upload it into your own Grooper environment (version 2024). The first contains one or more Batches of sample documents. The second contains one or more Projects with resources used in examples throughout this article.

About

The goal of AI Extract is to extract data by simply describing the data to be extracted as a result of the "self-descriptive" nature of the "container" Data Elements (Data Model, Data Section, or Data Table) itself. This fill method presents an AI chatbot with all or part of the document content, and asks it to generate JSON data matching the structure of descendant Data Elements. The JSON data is then used to populate descendant Data Sections, Data Tables, and Data Fields.

More Data. Less Work.

The Big Idea

Use an LLM to extract data by simply describing the data to be extracted.

  • Let the Data Model be self-descriptive.
  • The AI should “know” what to extract based on how Data Elements in the Data Model are named.
  • Allows Grooper to extract a Data Model with a fraction of the set up.
  • No need for hand-crafted extractors.


AI Extract: The First Fill Method

AI Extract is the first Fill Method in Grooper.

  • Fill Methods can be configured on “container” Data Elements
    • Data Model, Data Section, Data Table
    • Be aware: AI Extract currently has no known use on a Data Table.
  • Fill Methods are secondary extraction operations which populate child Data Elements. They run after extraction.
  • Ex: A Fill Method is set on a Data Model.
    • First, all its child Data Elements run their configured Value Extractors and extract methods.
    • Then, the Fill Methods runs. It can be configured to overwrite extraction results or only fill blank fields.


Make LLMs Do The Work

  • Minimal configuration is required to start getting data back.
  • In many cases, well-named Data Elements are all that is required.
  • If necessary, Descriptions are added to Data Elements as needed to specify special instructions.
  • How it works:
    • Prompts the LLM with some document content.
    • Asks the LLM to generate JSON matching the Data Element structure.
    • Parses the resulting JSON data into sections, tables, and fields.


AI Extract Pros and Cons

Understanding the benefits and drawbacks of AI Extract can help you determine if using it is right for your needs.

Pros

  • Instant time to value
  • Ease of use is off the charts. Just define the Data Model and go.
  • Fills in a full Data Model with fewer API calls than Ask AI
  • Has some result highlighting (defined by “Alignment” settings)

Cons

  • LLM responses can be unpredictable
  • LLM responses can be inaccurate
  • Easy to become over-confident when using this. There are always errors when you look closely
  • Larger documents pose more of a challenge to fully extract
  • Slower than traditional Grooper extractors (particularly for larger documents)
  • Result highlighting is not perfect

How To

The following walkthrough will use an invoice as part of the supplied materials. Invoices are easy to understand and the setup too, will be easy.

HOWEVER, PLEASE BE AWARE:

  1. LLM chatbots are typically best suited for documents with natural language flow such as legal contracts. We will use invoices for this example. Invoices do not represent documents with natural language flow, and instead are a type of structured data document. However, if a structured document does use natural language to describe information (such as labels next to a value), Grooper can get usable data using LLM based extraction methods like AI Extract. However, it may not be able to do so for all structured and semi-structured documents.
  2. LLM chatbots do not always return reliable information as they inherently do not "understand" information. They simply predict the most likely response, one word after the other. This can lead to innacurate responses. LLM-based extraction methods will typically provide quick results with little setup but may not be 100% accurate.
  3. Reliance on accurate information will typically involve human review to verify data integrity. An important aspect of reviewing data in Grooper is the highlighting of returned results in the Document Viewer. At best it is difficult to accurately highlight data from a chatbot as there is not technically any character coordinate information being returned from the chat bot as part of its delivered data. There are some settings of properties in the configuration of AI Extract that can aid in this highlighting, but it isn't perfect. At worst, it can be impossible at times to highlight data from a chat bot at all.

Establish an LLM Connector

First, we need to establish an LLM Connector within the Options property on the Root object.

Please visit the LLM Connector article for more information.

Configure AI Extract Fill Method

With an LLM Connector established we can now configure our Data Model with the AI Extract Fill Method.

  1. Select the Data Model from the provided Project.
  2. Click the ellipsis button on the Fill Methods property.
  3. Click the "Add" button in the "Fill Methods" window.
  4. Choose "AI Extract" from the drop-down menu.


  1. Click the ellipsis button on the Model property of the newly added AI Extract Fill Method.
  2. Select "gpt-4o" in the "Model" window. As of the writing of this article, it is the most accurate model, and the best choice.
    • Feel free to experiment with the other models to test results.


  1. In the Parameters property group, lower the Temperature property to 0.2. This can help the AI be less "creative" with its responses.
    • You may want to go as low as 0 to completely eliminate "creativity".
    • Please see the Parameters article for more information.


  1. Click the ellipsis button on the Instructions property.
  2. Write a prompt in the "Instructions" window.
    • Prompts written here should be considered as "global" prompts for the entire model, not specific to individual Data Elements.


  1. If you want to choose specific elements to extract you can do so. Click the ellipsis button on the Included Elements property.
  2. In the "Included Elements" window you can choose specific elements.
    • Leaving this property default, or blank, will consider all Data Elements. Choosing specific Data Elements will only include those selected as part of the prompt given to the AI.


  1. Document Quoting controls the text fed to the AI.
  2. Preprocessing controls the text supplied to the AI by adding or removing control characters.
  3. ... Alignment controls the highlighting of results in the Data Model Preview.
    • Please see the Alignment article for more information.

Adjustments and Results

Let's now take a look at how using the Description property on Data Elements can be leveraged as prompts to give us better results and see the final output of extraction.

  1. Select the "LineItems" Data Table.
  2. Click the ellipsis button on the Description property.
  3. Add instructions in the "Description" window.
This is the table for the line items ordered on the invoice.


  1. Select the "custPo" Data Field.
  2. Click the ellipsis button on the Description property.
  3. Add instructions in the "Description" window.
If the "Customer P.O." has a result of "SEE BELOW", use the "AFE #" instead.

  1. Click on the Data Model.
  2. Click on the "Tester" tab.
  3. Click the "Test" button.
  4. Notice successful extracted results in the Data Model Preview.