Fine-Tuning for AI Extract

This article is about the current version of Grooper.

Note that some content may still need to be updated.

2025

2024

Fine-tuning is the process of further training a large language model (LLM) on a specific dataset to make it more specialized for a particular task or domain. This allows the model to adapt its general language understanding to better handle the unique vocabulary, style, and structure of the domain it's fine-tuned on.

In Grooper, you can easily start fine-tuning a model based on a data_table Data Model that will facilitate better extraction when using AI Extract.

You may download the ZIP(s) below and upload it into your own Grooper environment (version 2024). The first contains one or more Batches of sample documents. The second contains one or more Projects with resources used in examples throughout this article.

About

Fine-tuning is the process of further training a large language model (LLM) on a specific dataset to make it more specialized for a particular task or domain. This allows the model to adapt its general language understanding to better handle the unique vocabulary, style, and structure of the domain it's fine-tuned on.

In Grooper, you can easily start fine-tuning a model based on a data_table Data Model that will facilitate better extraction when using AI Extract.

Info on fine-tuning from OpenAI

This is an excerpt from OpenAI's "model optimization" documentation.

OpenAI models are already pre-trained to perform across a broad range of subjects and tasks. Fine-tuning lets you take an OpenAI base model, provide the kinds of inputs and outputs you expect in your application, and get a model that excels in the tasks you'll use it for.

Fine-tuning can be a time-consuming process, but it can also enable a model to consistently format responses in a certain way or handle novel inputs. You can use fine-tuning with prompt engineering to realize a few more benefits over prompting alone:

You can provide more example inputs and outputs than could fit within the context window of a single request, enabling the model handle a wider variety of prompts.

You can use shorter prompts with fewer examples and context data, which saves on token costs at scale and can be lower latency.
You can train on proprietary or sensitive data without having to include it via examples in every request.
You can train a smaller, cheaper, faster model to excel at a particular task where a larger model is not cost-effective.

How Fine-Tuning Works

Base Model: The process starts with a large, general-purpose language model that has already been pre-trained on a vast corpus of text data.
Custom Dataset: You provide a dataset specific to your application. This dataset should align with the tasks you want the model to perform.
- In Grooper's case, this will be a "fine-tuning file" generated from a set of document's with idealized extracted data. These document's should have their extracted data reviewed and manually corrected before generating the fine-tuning file.
Training: The base model is further trained on this dataset. This will modify the base model's internal parameters so it can better generate content related to the dataset.
Evaluation: After fine-tuning, users should evaluate the model to ensure it performs well on the target task.

Benefits

General benefits to fine-tuning

Improved Performance: Fine-tuning makes the model more accurate and efficient for the specific tasks it was trained for, such as answering questions about a product, generating legal documents, or responding in a customer support chat.
Faster Development: By using a pre-trained model as a base, fine-tuning significantly reduces the time and computational resources needed compared to training a model from scratch.
Customization: It allows developers to incorporate their proprietary data, which the general model might not have seen before, into the model's understanding.

Benefits to fine-tuning for AI Extract

AI Extract is Grooper's large-scale LLM-based data extraction tool. After fine-tuning a model, users can select that fine-tuned model when choosing the LLM AI Extract uses. This can have a significant impact on improving the accuracy and efficiency of data extraction processes. Here's how fine-tuning can improve AI Extract's capabilities:

Customization according to target data: By fine-tuning models with specific data types, patterns, and documents that are regularly processed by your organization, Grooper can better learn and understand how to extract data in those contexts. This would allow it to adapt its extraction techniques to more accurately capture the fields and formats specific to your use cases, reducing the need for manual configuration or rule-setting.
Handling variations: Fine-tuning can help Grooper handle more complex and varied document formats. For example, if your Data Models involve extracting information from forms or tables with slight variations, a fine-tuned model would be better at recognizing those differences and still extracting the correct data fields. It essentially allows Grooper to generalize and adapt to more variations in the data without losing accuracy.
Increased efficiency with domain-specific knowledge: If Grooper is extracting data for a specialized domain (such as legal, financial, or medical documents), fine-tuning the model on domain-specific language can significantly improve Grooper's ability to understand and extract complex or technical information accurately. This leads to fewer false positives and less manual correction.
Reduction of errors and manual overrides: Fine-tuning Grooper's models based on historical data can reduce the number of errors during extraction, as the model becomes more familiar with the nuances of your data structure. Over time, this leads to a decrease in manual validation or correction, streamlining the process and saving effort.
Continuous improvement: As your data evolves, Grooper can remain up-to-date by continuously fine-tuning on new document types or formats. This helps it stay accurate even as the characteristics of your data change, which is especially important for businesses dealing with dynamic, ever-changing sources of information.

General steps to creating a fine-tuned model from Grooper

Grooper fine tunes models by starting a "fine-tuning job" using the OpenAI API (or any API that follows their standard). There are five basic steps:

Review documents for idealized results.
After running one or more Batches through Extract, you should also manually correct each document's data using a Review step's Data Viewer. This will ensure the data used to fine tune a model is accurate. These should be your "gold standard" documents.
Execute the "Build Fine Tuning File" command.
This is found on any Data Container (Data Model, Data Section, Data Table) configured with AI Extract. You will select one or more Batches with previously reviewed documents to build the fine-tuning file. When finished, the command will create a JSONL file you will use to start the fine tuning job.
- The Batch must have at least ten (10) documents. The fine-tuning job will fail if less than 10 documents are used to build the file.
- Grooper creates the JSONL file in the parent Content Type's folder_data Local Resources Folder. You will need to add the Local Resources Folder to the Content Type before executing the "Build Fine Tuning File" command if it does not already exist.
Select the fine-tuning file and execute the "Start Fine Tuning Job" command.
The "Build Fine Tuning File" command will place the fine-tuning file in the parent Content Type's folder_data Local Resources Folder. Right click it to execute the "Start Fine Tuning Job" command.
Wait for the fine-tuning job to complete.
This may take a while. There is no way to monitor the fine-tuning job's progress directly from Grooper. However, you can monitor an OpenAI fine-tuning job from the OpenAI API Platform's Fine-tuning page.
Select the fine-tuned model from Grooper.
Once the fine-tuned model is successfully created, you can now select it when configuring AI Extract!