Grooper and On-Prem AI: Difference between revisions

From Grooper Wiki
No edit summary
Line 10: Line 10:
* Give some background on LLM services and how data is transmitted to those services
* Give some background on LLM services and how data is transmitted to those services
* Differentiate between cloud hosted LLM services and LLM services hosted on-premises
* Differentiate between cloud hosted LLM services and LLM services hosted on-premises
* Share our results testing Grooper using on-prem LLM service.
* Share our results testing Grooper using an on-prem LLM service.


==Background==
==Background==

Revision as of 10:44, 13 November 2025

You’ve heard about AI’s ability to make work easier. AI can get data from your documents you couldn’t before. AI saves you time so you can extract information you just didn’t have time for before. AI makes it so you don’t have to have a super genius in charge of your system.

Is it true? Absolutely. And Grooper can help you do it.

But you’re also worried about your data leaving your network. Information security, privacy, and fair use concerns have led many organizations to distrust or outright dismiss LLM services, such as those offered by OpenAI and Microsoft.

Does that mean you can’t use AI without data leaving your network? Nope. You can deploy an LLM on-prem. And Grooper can connect to it.

In this article, we will:

  • Give some background on LLM services and how data is transmitted to those services
  • Differentiate between cloud hosted LLM services and LLM services hosted on-premises
  • Share our results testing Grooper using an on-prem LLM service.

Background

Grooper's AI-enabled features have revolutionized intelligent document processing. It’s never been easier to extract data from both structured and unstructured documents. These features rely on a type of artificial intelligence model called large language models (LLMs). These models have become so advanced that they do most of the heavy lifting for you. In many cases, the complicated extraction logic a Grooper designer would have to “hard code” into a Data Model is totally replaced by the LLM. The hours it used to take to design a system can be reduced to mere seconds.

AI-enabled features offer several key benefits when compared to traditional Grooper solution design:

  • AI saves users time setting up a Grooper solution.
  • Configuration is simple. Less technical training is required for Grooper designers to test and deploy AI-enabled features.
  • AI gets good results from a wide range of document types in a wide range of industries.
  • You can start seeing results at a fraction of a fraction of the time.

What’s an LLM service?

Most LLMs operate via "chat completions" sent to an LLM service. This is the process by which the model responds to a prompt of some kind. A user asks a question. The LLM responds. Online LLM services (such as OpenAI API) facilitate these chat completions.

It's essentially a three-step process:

  1. A prompt is sent to the LLM service.
    • This prompt asks the LLM some kind of question or otherwise prompts it for a response. Additional system messages may be included to provide additional context required to answer the question.
    • Generic Example: You ask ChatGPT a question like “What’s the best movie of all time?”
    • Grooper Example: When AI Extract runs, it sends the document’s text, instructions containing the Data Model schema and any user-defined instructions.
  2. The LLM service’s model evaluates the prompt and generates its response.
    • Based on the input prompt and the model’s training, it predicts the most likely response. It uses patterns it learned during its training to do this.
    • Because LLMs have been trained on massive amounts of natural language, they are adept at generating responses that (1) are generally accurate and (2) appear as if a human wrote the response.
    • Because LLMs have been trained on massive amounts of structured data as well, LLMs are also adept at responding in structured formats that Grooper can parse for its own consumption.
  3. The LLM service responds with the model’s response.
    • Generic Example: ChatGPT responds to your question about what’s the best movie of all time based on its training data. Answers will vary, but it will probably mention Godfather/Citizen Kane/Casablanca. Those are often cited in “best of” lists.
    • Grooper Example: When AI Extract gets the response back, Grooper parses the response into the various fields in the Data Model.

Cloud vs on-prem LLM services

When it comes to integrating with an LLM service, there are two big options:

  • Cloud-based LLM services
  • On-premises LLM services

Cloud-based LLM services are hosted by third party providers, such as OpenAI or Microsoft. They are accessed through APIs using web calls over the internet. They are easy to connect to, often just requiring an API key. Grooper can connect to the following cloud based LLM providers:

  • OpenAI API
  • Microsoft Azure AI Foundry models
  • OpenAI compatible services
    • This is any LLM service provider that conforms to the OpenAI API standard.
    • Compatible APIs must use "chat/completions" and "embeddings" endpoints like OpenAI API to interoperate with Grooper's LLM features.

On-premises LLM services are deployed in your own infrastructure within your own network. This gives users full control over data sent to and from the model. Hosting models on-prem is a powerful way to maintain control over data and may be required in certain industries due to regulations or compliance standards.

Grooper can connect to any on-prem LLM service that conforms to the OpenAI API standard. This service must use “chat/completions” and “embeddings” endpoints like OpenAI API to interoperate with Grooper’s LLM features. This may require wrapping an open-source LLM model around an API that is compatible with OpenAI.

When evaluating cloud vs on-prem LLM services, there are three key considerations:

  • Resources and cost
  • Model availability
  • Data privacy and retention

Resources and cost

Large language models (LLMs) consume exceptional amounts of resources, both in terms of hardware and power. High-performance GPUs are critical for LLM responses. The GPU executes the model’s response during a process called “inference”. Without high-performance GPUs, inference (and therefore processing throughput) will be abysmally slow. High-core-count CPUs are also a must to coordinate how tasks are routed to the GPU. Large amounts of RAM are needed to load the model's parameters and input data before passing it to the GPU. Typically, the more parameters a model has, the larger it is, and therefore, the more RAM you need.

  • Cloud-based LLM services use their own infrastructure.
  • Cloud-based LLM services pass that cost to you by charging for inputs to and outputs from the LLM service.
    • Costs are calculated using "tokens", which are fragments of words.
    • Each model has different costs associated with input tokens (prompt text going in) and output tokens (response text coming out).
  • On-prem installations require you to provide the infrastructure required to run the LLM operations.
    • The hardware required to sufficiently run an LLM service effectively amounts to a supercomputer.
    • This can be a significant cost up front (tens to hundreds of thousands of dollars per machine running an LLM service).
  • On-prem installations do not have any costs associated with running the LLM service besides the electricity it takes to power the hardware.
    • Be aware the GPUs required to host an LLM service draw substantially more power than a standard computer.

Model availability

When evaluating a cloud or on-prem LLM service, it’s important to know what models you can utilize.

Cloud LLM services will offer their own proprietary LLM models or a mix of proprietary and open-source models. Grooper primarily integrates with the OpenAI API and models deployed using Azure AI Foundry.

  • OpenAI gives you access to OpenAI’s gpt model series.
  • Azure AI Foundry gives you access to proprietary models (such as OpenAI’s gpt models and xAI’s Grok models) as well as several open-source models (such as Mistral, DeepSeek, Qwen and other models).

On-prem LLM services will need to deploy an open-source LLM model. You cannot deploy proprietary models on hardware running in your local environment (unless deploying an open-source version of a proprietary model, such as OpenAI's open-weight models). Furthermore, to interoperate with Grooper’s AI-enabled features, the on-prem LLM service must conform to the OpenAI API standard.

Most open-source models do not adhere to the OpenAI API standard. When deploying an on-prem LLM service these models will need to be wrapped or adapted to follow it. To use most of Grooper’s AI-enabled features, they will need a "chat/completions" endpoint and message formatting (system, user and assistant roles) like OpenAI’s gpt models. Embeddings models will need to be deployed with a usable "embeddings" endpoint. Tools like OpenRoutor, LM Studio, Ollama and vLLM allow developers to expose open-source models via an API compatible with OpenAI. Libraries like FastAPI and Transformers (from Hugging Face) can help wrap models to mimic OpenAI’s format.

Data privacy and retention

This is easily the biggest reason users deploy an on-prem LLM service as opposed to using a cloud LLM service. Because standing up an on-prem LLM service is a significant investment, it’s important to know the facts about OpenAI and Microsoft Azure’s privacy policies.

OpenAI API privacy summary

  • Grooper integrates with the OpenAI API not ChatGPT.
  • API traffic is encrypted in transit and at rest.
  • OpenAI complies with SOC 2 Type II, ISO/IEC 27001, California Consumer Privacy Act (CCPA) and the EU’s General Data Protection Regulation (GDPR) standards.
  • API data is not used to train OpenAI models (unless users explicitly allow OpenAI to do so). This includes prompts and other inputs and completions and other outputs.
  • OpenAI may retain API data for 30 days to monitor for abuse. Data is deleted after 30 days unless flagged for investigation.
  • OpenAI does offer a “zero data retention” option for enterprise customers or dedicated deployments. Users will need to contact the OpenAI sales team to enter into this agreement.

Azure AI Foundry privacy summary

  • Models are deployed in the user’s tenant. Users control infrastructure and access policies for managed deployments.
  • Azure’s core privacy and security commitments apply to Azure AI Foundry.
  • AI Foundry traffic is encrypted in transit and at rest.
  • API data is not used to train models.
  • Microsoft does not store prompts and other inputs or completions and other outputs (unless you explicitly enable logging or monitoring).
  • Microsoft complies with SOC 2 Type II, ISO/IEC 27001, California Consumer Privacy Act (CCPA) and the EU’s General Data Protection Regulation (GDPR) standards.
  • Users can deploy models in private networks, enforce data residency, and use role-based access control.
  • Azure AI Foundry can deploy OpenAI models as well. Azure OpenAI offers stricter data isolation and enterprise-grade privacy controls compared to OpenAI’s public API.

If OpenAI or Microsoft Azure’s privacy policies are insufficient for your organization, deploying an on-premises LLM service may be the option for you. The key benefit to hosting an LLM service on-prem is it’s totally under control in your own infrastructure. Drawbacks, however, include high costs upfront and it requires more technical expertise to set up and troubleshoot the solution.

Summary of on-prem LLM deployment pros and cons

Pros

  • Reduced risk of exposing data to third parties.
  • Data stays in your infrastructure under your network controls. Can operate in disconnected environments (no internet access).
  • No dependency on internet bandwidth or cloud service availability. With enough processing power, local deployments can reduce latency compared to cloud LLM providers.
  • All costs are tied to hardware and maintenance. This can lead to potential savings for high-volume usage.

Cons

  • Setup requires significant investment up front. GPUs, storage, cooling and other infrastructure costs associated with setup on a machine capable of running an LLM service are quite high.
  • A good deal of technical expertise is required to set up the environment. Further technical expertise is required to best optimize and troubleshoot the system.

On-prem LLM test results

To ensure an on-prem LLM environment will run at production levels, Grooper ran a test at our headquarters. We wanted to make sure that an on-prem LLM service running AI Extract on several hundred documents would run at least as fast as Azure OpenAI (which we currently use for our own production LLM service). Running a high-performance GPU-accelerated server at half capacity, we were able to match the speeds we’ve seen from processing with Azure OpenAI. Furthermore, the open-source model we used delivered just as accurate results for our test cases.

Key considerations

In executing this test, we had two primary considerations we needed to evaluate.

  1. It must be worth the cost. On-prem LLM installations are expensive. This is a significant investment for any organization. Grooper must be able to run effectively and efficiently on such expensive hardware.
  2. Open-source models must perform adequately. While proprietary models and LLM services like OpenAI are regarded as the “gold standard” of LLM-centric processing, you cannot use them when deploying an on-prem LLM service. For this test to be successful, open-source models should perform on par with proprietary models like OpenAI’s gpt series.

Hardware specifications

AI server: HPE Cray XD670

  • 8x NVIDIA H200 GPUs
  • 2TB DDR5-560R ECC
  • 2x Intel Xeon Platinum 8593Q CPUs, 64 cores each @ 2.2 GHz

Grooper processing farm:

  • 13x Microsoft Windows Server 2019 Datacenter VMs
  • 2x Intel Xeon Silver 4216 CPUs, 4 cores each @ 2.1 GHz per VM
  • 16GB RAM per VM
  • 52 total concurrent threads

LLM specifications

Open-source model: Qwen2.5-32B-Instruct

  • Developed by Alibaba Cloud as part of the Qwen 2.5 generation.
  • 32 billion parameters. More parameters translate to more base training data.
  • The "Instruct" variant is fine tuned to follow user instructions more accurately and naturally. Better equipped to handle instructions written in natural language, making it ideal for chatbots, assistants and other interactive applications.
  • Default context length is 32K tokens (can be extended up to 128K tokens). Effects how much document text/instructions/other input data can be submitted to the model.
  • Can generate up to 8K output tokens. Effects how long the model’s response can be.

Testing methodology

Setup

  • We deployed Qwen2.5-32B-Instruct on our HPE Cray XD670 server.
  • We used vLLM to deploy the model and expose OpenAI compatible endpoints.
  • We enabled 4 of the 8 GPUs.
  • Each GPU was throttled to run at 88% capacity (Per manufacture guidelines, between 10-15% of the GPU needs to run internal/native processes required to run the GPUs effectively).
  • Activity Processing services on the Grooper processing farm were enabled with a total of 52 threads running concurrently.

Test #1: Speed test

  • The purpose of this test was to compare the speed at which our on-prem LLM service executed AI Extract tasks compared to Azure OpenAI.
  • On-prem model: Qwen2.5-32B-Instruct
  • Azure OpenAI model: gpt-5-mini
  • 600 documents were ran through AI Extract.
    • These documents contained correspondence with various state tax forms attached.
    • They ranged from 2 to 8 pages in length.
    • A total of 10 Data Fields were collected.
  • Results: Processed 15-20% faster than Azure OpenAI
    • Using the Data Model tester tabs was also observed to be faster. The normal latency users experience when testing Data Models using AI Extract was functionally gone.
  • Extrapolation:
    • With 4 of 8 GPUs, the AI server could process 172,800 documents per day.
    • With all 8 GPUs, the AI server could process 345,600 documents per day.

Test #2: Accuracy test

  • The purpose of this test was to compare the accuracy of AI Extract run using our on-prem LLM service with AI Extract run using Azure OpenAI.
    • On-prem model: Qwen2.5-32B-Instruct
    • Azure OpenAI models: gpt-5-mini
  • Tested Data Models using AI Extract on a variety of extraction use cases.
    • Invoices
    • Payroll Tax Notices
    • EOBs
    • Oil and gas leases (Note: Only a portion of the larger Data Model was tested. However, the most complex Data Section was tested)
  • Results: Data extraction output was comparable to OpenAI in most cases
    • For all use cases there were minor differences.
    • In some documents, there was a slight accuracy decrease. However, we made no changes to the user-defined AI Extract prompts in this test. With minimal prompt adjustments (in the AI Extract instructions), we expect fully comparable results.

Conclusions

These tests confirm an on-prem LLM deployment can be as performative as one using a cloud LLM service.

  • HPE Cray XD670 lives up to its hype. It is a high-performance AI supercomputer. Even at half capacity, it matched Azure OpenAI in terms of speed.
  • Qwen2.5-32B-Instruct performed admirably.
    • It has a good "size to performance" ratio. Given its compact size compared to other open-source model options, its performance is quite good.
    • Grooper’s built-in instructions appear to work just fine. There was no need to reconfigure instructions to get Qwen to respond well.
    • AI Extract results are comparable to OpenAI models. With adjustments to instructions injected into the prompt, results will be better aligned with OpenAI models.
    • It demonstrates good results in a variety of use cases.