Ingest and Index (Simple Functionality)

From Grooper Wiki

This article is about the current version of Grooper.

Note that some content may still need to be updated.

2025

You may download the ZIP files below for use in your own Grooper environment (version 2025). These are Project ZIP files.

Introduction

Ingest and Index (Simple Functionality) demonstrates how to take documents from an external source, process them with OCR and AI Extract, and ultimately add them to a searchable index within Grooper. This article brings together several core capabilities—document ingestion, text recognition, AI-driven data extraction, and search indexing—into a single, end-to-end workflow.

The intention of this article is to provide a foundational example of how Grooper can transform raw documents into searchable, structured, and retrievable content. Using a generic Content Model and Data Model, the Project captures common document information while also preparing both the document text and extracted data for indexing. This allows users to not only store documents, but also search and retrieve them using Grooper’s Search Page, including support for modern vector-based (semantic) search.

To support this workflow, the article walks through the required configuration across multiple components of Grooper. This includes setting up an LLM Connector for AI Extract, configuring Azure DI OCR for text recognition, defining indexing behavior (including optional vector embeddings and chunking strategies), and preparing a Batch Process that imports, processes, reviews, and indexes documents.

By the end of this guide, readers will understand how Grooper connects ingestion, AI extraction, and search into a unified pipeline—demonstrating how documents move from raw files to fully indexed assets that can be queried and explored through Grooper’s search capabilities.

Setup for AI Extract

This portion of the article focuses on configuring Grooper’s AI Extract capability so documents can be analyzed by a Large Language Model (LLM) and mapped into a Data Model. It involves setting up an LLM Connector within the Grooper Repository and selecting an appropriate model through the Data Model’s Fill Methods.

The goal of this configuration is to enable Grooper to interpret document content and populate generic fields—such as document identifiers, dates, and party information—without relying on rigid, template-based extraction. This setup establishes the connection between Grooper and the external LLM provider, ensuring AI Extract can execute during Batch Processing.

  1. Select the Root node, then click the ellipsis button for the Options property to open the Options editor.
  2. Add an LLM Connector, then be sure to properly configure it.
    • The most important configuration is choosing a service provider for the Service Provider property, and properly configuring it.
  3. Expand the Node Tree and select the Data Model from the provided "Ingest and Index (File Import)" Project, then click the ellipsis button for the Fill Methods property to open the "Fill Methods" editor.
  4. Expand the Generator sub-properties and be sure to select a desired model for the Model property.

Setup for Azure DI OCR

This section covers configuring the Azure DI OCR Profile, which is responsible for converting image-based content into machine-readable text. By supplying an Azure Computer Vision API key and matching the correct region, Grooper can leverage Azure DI's OCR engine to process scanned or image-only documents.

This step ensures that all documents—whether they contain embedded text or not—have usable text content for downstream processing. OCR output is critical not only for AI Extract, but also for search indexing, as it provides the textual data that both extraction models and search engines rely on.

  1. Select the Root node, then click the ellipsis button for the Options property to open the Options editor.
  2. In the "Options" editor, add an "Azure Document Intelligence" option, then properly configure it.
    • The most important property is the API Key.
  3. Expand the Node Tree and right-click the "Azure OCR" OCR Profile from the provided "Ingest and Index (File Import)" Project, then select "Rename" from the pop-out menu.
  4. Set the New Name property to "Azure DI OCR".
  5. Right-click the OCR Engine property, then select "Reset" from the pop-out menu.
  6. Set the OCR Engine property to "Azure DI OCR".

Setup for Search Index

This portion explains how to configure Grooper’s AI Search capabilities to index processed documents and make them searchable. It includes adding and configuring an AI Search option in the repository and defining Indexing Behavior on the Content Model.

Key elements of this setup include selecting an embeddings model for vector-based search and optionally enabling chunked indexing for large or text-heavy documents. Once configured, a search index is created and linked to the Content Model, allowing Grooper to store both document content and associated metadata in a way that supports fast and flexible retrieval.

Final setup

The final section brings all components together into a complete, operational workflow. It covers preparing the necessary services (such as Activity Processing and Import Watcher), publishing the Batch Process, and configuring document ingestion from a file system.

This Batch Process orchestrates the full pipeline: importing documents, performing OCR, executing AI Extract, pausing for user validation in Review, and finally adding documents to the search index. After indexing is complete, documents can be queried and retrieved through the Grooper Search page.

This portion emphasizes how individual configurations—AI Extract, OCR, and indexing—work together as a cohesive system, enabling a seamless transition from raw document ingestion to fully searchable, structured content.

Considering emails and scanning

In this final section we'll take a quick look at the other two provided sample Projects and see how their Batch Processes differ when considering email processing and scanning.

For More Information