Fine-Tuning for AI Extract

From Grooper Wiki
(Redirected from Fine Tuning)

This article is about the current version of Grooper.

Note that some content may still need to be updated.

20252024

Fine-tuning is the process of further training a large language model (LLM) on a specific dataset to make it more specialized for a particular task or domain. This allows the model to adapt its general language understanding to better handle the unique vocabulary, style, and structure of the domain it's fine-tuned on.


In Grooper, you can easily start fine-tuning a model based on a data_table Data Model that will facilitate better extraction when using AI Extract.

You may download the ZIP(s) below and upload it into your own Grooper environment (version 2024). The first contains one or more Batches of sample documents. The second contains one or more Projects with resources used in examples throughout this article.

Info on fine-tuning from OpenAI

This is an excerpt from OpenAI's "model optimization" documentation.

OpenAI models are already pre-trained to perform across a broad range of subjects and tasks. Fine-tuning lets you take an OpenAI base model, provide the kinds of inputs and outputs you expect in your application, and get a model that excels in the tasks you'll use it for.

Fine-tuning can be a time-consuming process, but it can also enable a model to consistently format responses in a certain way or handle novel inputs. You can use fine-tuning with prompt engineering to realize a few more benefits over prompting alone:

You can provide more example inputs and outputs than could fit within the context window of a single request, enabling the model handle a wider variety of prompts.

  • You can use shorter prompts with fewer examples and context data, which saves on token costs at scale and can be lower latency.
  • You can train on proprietary or sensitive data without having to include it via examples in every request.
  • You can train a smaller, cheaper, faster model to excel at a particular task where a larger model is not cost-effective.

How Fine-Tuning Works

  • Base Model: The process starts with a base LLM (large language model). This is a general-purpose model that has already been pre-trained on a vast corpus of text data (such as gpt-3.5-turbo or gpt-4o-mini).
  • Idealized Examples: You provide a set of example prompts with ideal responses for your use case. These idealized examples better inform the base model how to respond.
    • In Grooper's case, this will be a "fine-tuning file" generated from a set of document's with idealized extracted data. These document's should have their extracted data reviewed and manually corrected before generating the fine-tuning file.
  • Training: The base model is then trained on this set of examples. This will modify the base model's internal parameters so it can better generate responses.
  • Evaluation: After fine-tuning, users should evaluate the model to ensure it performs well on the target task.

Benefits

General benefits to fine-tuning

  • Improved Performance: Fine-tuning makes the model more accurate and efficient for the specific tasks it was trained for, such as answering questions about a product, generating legal documents, or responding in a customer support chat.
  • Faster Development: By using a pre-trained model as a base, fine-tuning significantly reduces the time and computational resources needed compared to training a model from scratch.
  • Customization: It allows developers to incorporate their proprietary data, which the general model might not have seen before, into the model's understanding.


Benefits to fine-tuning for AI Extract

AI Extract is Grooper's large-scale LLM-based data extraction tool. After fine-tuning a model, users can select that fine-tuned model when choosing the LLM AI Extract uses. This can have a significant impact on improving the accuracy and efficiency of data extraction processes. Here's how fine-tuning can improve AI Extract's capabilities:

  • Customization according to target data: By fine-tuning models with specific data types, patterns, and documents that are regularly processed by your organization, Grooper can better learn and understand how to extract data in those contexts. This would allow it to adapt its extraction techniques to more accurately capture the fields and formats specific to your use cases, reducing the need for manual configuration or rule-setting.
  • Handling variations: Fine-tuning can help Grooper handle more complex and varied document formats. For example, if your Data Models involve extracting information from forms or tables with slight variations, a fine-tuned model would be better at recognizing those differences and still extracting the correct data fields. It essentially allows Grooper to generalize and adapt to more variations in the data without losing accuracy.
  • Increased efficiency with domain-specific knowledge: If Grooper is extracting data for a specialized domain (such as legal, financial, or medical documents), fine-tuning the model on domain-specific language can significantly improve Grooper's ability to understand and extract complex or technical information accurately. This leads to fewer false positives and less manual correction.
  • Reduction of errors and manual overrides: Fine-tuning Grooper's models based on historical data can reduce the number of errors during extraction, as the model becomes more familiar with the nuances of your data structure. Over time, this leads to a decrease in manual validation or correction, streamlining the process and saving effort.
  • Continuous improvement: As your data evolves, Grooper can remain up-to-date by continuously fine-tuning on new document types or formats. This helps it stay accurate even as the characteristics of your data change, which is especially important for businesses dealing with dynamic, ever-changing sources of information.

General steps to creating a fine-tuned model from Grooper

The "Build Fine Tuning File" command creates a JSONL fine-tuning file in the Local Resources Folder. From here, execute the "Start Fine Tuning Job" to create a fine-tuned model.

Grooper fine tunes models by starting a "fine-tuning job" using the OpenAI API (or any API that follows their standard). There are five basic steps:

  1. Review documents for idealized results.
    After running one or more Batches through Extract, you should also manually correct each document's data using a Review step's Data Viewer. This will ensure the data used to fine tune a model is accurate. These should be your "gold standard" documents.
  2. Execute the "Build Fine Tuning File" command.
    This is found on any Data Container (Data Model, Data Section, Data Table) configured with AI Extract. You will select one or more Batches with previously reviewed documents to build the fine-tuning file. When finished, the command will create a JSONL file you will use to start the fine tuning job.
    • The Batch must have at least ten (10) documents. The fine-tuning job will fail if less than 10 documents are used to build the file.
    • Grooper creates the JSONL file in the parent Content Type's folder_data Local Resources Folder. You will need to add the Local Resources Folder to the Content Type before executing the "Build Fine Tuning File" command if it does not already exist.
  3. Select the fine-tuning file and execute the "Start Fine Tuning Job" command.
    The "Build Fine Tuning File" command will place the fine-tuning file in the parent Content Type's folder_data Local Resources Folder. Right click it to execute the "Start Fine Tuning Job" command.
  4. Wait for the fine-tuning job to complete.
    This may take a while. There is no way to monitor the fine-tuning job's progress directly from Grooper. However, you can monitor an OpenAI fine-tuning job from the OpenAI API Platform's Fine-tuning page.
  5. Select the fine-tuned model from Grooper.
    Once the fine-tuned model is successfully created, you can now select it when configuring AI Extract!


How To

Create a fine-tuning file and start a fine-tuning job

Prerequisites

  1. Ensure you have an LLM Connector added to your Grooper Repository.
  2. Get one or more Batches ready for fine-tuning.
    • These should be Batches with ideal data. Any errors in their extracted Data Model should be manually reviewed and corrected.
    • You must have a minimum of 10 examples to fine-tune a model. Ensure the Batch has at least 10 documents.
  3. Ensure you have a Local Resources Folder added to the Data Model's (or Data Section's) parent Content Type.
    • The fine-tuning file is created as a Resource File that lives in this Local Resources Folder. If the folder does not exist, the "Build Fine Tuning File" command will fail.

Build the fine-tuning file

  1. Right-click the Data Model (or Data Section) where AI Extract is configured.
  2. Select "Fine Tuning > Build Fine Tuning File"
  3. (Only if necessary) Select the "Fill Method Name".
    • In most cases, you will not need to configure this property. This will be auto populated with the selected Data Element's Fill Method's name ("AI Extract" unless you changed it).
    • In rarer cases where multiple Fill Methods are added to a Data Model/Data Section, this dropdown will allow you to choose which one you want to use for fine-tuning.
  4. Select one or more Batches for fine-tuning.
  5. Press the "Execute" button.
  6. A Resource File will be created for the fine-tuning file in the Data Model's (or Data Section's) parent Content Type's Local Resources Folder.
    • The fine-tuning file is a JSONL file. Each line in the JSONL file is a separate JSON object, one for each document in the Batch (or Batches). Each JSON object is an example conversation that will be used to better inform a base model how to respond, and thus improve AI Extract's response.

Start the fine-tuning job

  1. Locate the fine-tuning file. It is a Resource File in the Data Model's (or Data Section's) parent Content Type's Local Resources Folder.
  2. Right-click the fine-tuning file.
  3. Select "Fine Tuning > Start Fine Tuning Job".
  4. Open the "Base Model" editor and select a base LLM model from the list.
  5. Enter a "Name Suffix" of your choosing. This will be appended to the fine-tuned model's name.
  6. Press "Execute".
  7. This will submit the fine-tuning file to the OpenAI API (or an OpenAI compatible API) to start a fine-tuning job.
    • It may take a while for the fine-tuning job to complete. There is no way to monitor the fine-tuning job's progress directly from Grooper. However, you can monitor an OpenAI fine-tuning job from the OpenAI API Platform's Fine-tuning page.

Delete a fine-tuned model

You can delete any fine-tuned models directly from Grooper using the "Delete Fine Tuned Model" command.

  1. Right-click any fine-tuning file in Grooper.
  2. Select "Fine Tuning > Delete Fine Tuned Model".
  3. Open the "Model" editor (Press the "..." button).
  4. Select the fine-tuned model you want to delete.
    • Fine-tuned models created in OpenAI follow this naming convention:
      ft:{modelName}:{organizationName}:{nameSuffix}:{autoGeneratedId}
  5. Press "OK".
  6. Press "Execute" to delete the fine-tuned model.

OpenAI API limitations

OpenAI has a few limitations when it comes to submitting and completing a fine-tuning job. When starting a fine-tuning job from Grooper, it is best to be aware of these ahead of time.

You must have 10 training examples

The fine-tuning file must have 10 examples (10 lines in the JSONL file).

  • The fine-tuning job will fail if you have less than 10 examples.
  • For Grooper, an "example" is effectively a "data reviewed" document (a Batch Folder with extracted data that has been manually corrected using a Data Viewer).
  • This means the Batch you use for fine-tuning must have at least 10 documents in it.

Fine-tuning file size limits

At the time this article was written, the JSONL fine-tuning file cannot exceed 512 MB.

  • The fine-tuning job will fail if you have less than 10 examples.
  • If your training dataset exceeds 512 MB, you can split it up into multiple files. Effectively you will fine-tune in passes, fine-tuning a fine-tuned model with each subsequent JSONL file.
  • Be aware, the total number of tokens across all examples will still effect training cost and duration.

Token limits

Each base model has token limits that define how much content can be included in each training example. Each training example must fit within the model’s "context window". This includes both the prompt (any system system messages and the user message) and the expected completion (the assistant message).

  • 1 token = ~4 characters
  • Each model has its own max tokens per example (with newer models typically having larger max token limits).
  • If an example's token count exceeds a base model's context window, it will be truncated to fit within it. This may result in poor tuning.
    • If all examples are above the context window limit, the fine-tuning job will fail.
    • There have also been cases observed where a single example exceeding the context window limit causes the fine-tuning job to fail. Most likely, the example was truncated in such a way that it produced an improperly formatted example.
  • There is no overall token limit for the entire training file. As long as each example falls below the model's fine-tuning token limit, and the fine-tuning file is under 512 MB, OpenAI will start the fine-tuning job.