Ingest and Index (Simple Functionality): Difference between revisions

From Grooper Wiki
 
(2 intermediate revisions by the same user not shown)
Line 25: Line 25:
The goal of this configuration is to enable Grooper to interpret document content and populate generic fields—such as document identifiers, dates, and party information—without relying on rigid, template-based extraction. This setup establishes the connection between Grooper and the external LLM provider, ensuring AI Extract can execute during Batch Processing.
The goal of this configuration is to enable Grooper to interpret document content and populate generic fields—such as document identifiers, dates, and party information—without relying on rigid, template-based extraction. This setup establishes the connection between Grooper and the external LLM provider, ensuring AI Extract can execute during Batch Processing.


# Select the Root node, then click the ellipsis button for the Options property to open the Options editor.
# Select the [[Root]] node, then click the ellipsis button for the Options property to open the Options editor.
# Add an LLM Connector, then be sure to properly configure it.
# Add an LLM Connector, then be sure to properly configure it.
#* The most important configuration is choosing a service provider for the Service Provider property, and properly configuring it.
#* The most important configuration is choosing a service provider for the Service Provider property, and properly configuring it.
# Expand the Node Tree and select the Data Model from the provided "Ingest and Index (File Import)" Project, then click the ellipsis button for the Fill Methods property to open the "Fill Methods" editor.
# Expand the [[Node Tree]] and select the Data Model from the provided "Ingest and Index (File Import)" Project, then click the ellipsis button for the Fill Methods property to open the "Fill Methods" editor.
# Expand the Generator sub-properties and be sure to select a desired model for the Model property.
# Expand the Generator sub-properties and be sure to select a desired model for the Model property.


Line 56: Line 56:
# Add an "AI Search" option, then be sure to properly configure the "AI Search" option.
# Add an "AI Search" option, then be sure to properly configure the "AI Search" option.
#* You'll need to provide a URL and API Key.
#* You'll need to provide a URL and API Key.
# Expand the Node Tree and select the "Indexed Documents (Imported)" Content Model from the provided "Ingest and Index (File Import)" Project. Then, click the ellipsis button for the Behaviors property to open the Behaviors editor.
# Expand the Node Tree and select the "Indexed Documents (Imported)" Content Model from the provided "Ingest and Index (File Import)" Project. Then, click the ellipsis button for the [[Behaviors]] property to open the Behaviors editor.
# Select the "Indexing Behavior" Behavior. Expand the Vector Search sub-properties and select an Embeddings Model.
# Select the "Indexing Behavior" Behavior. Expand the Vector Search sub-properties and select an Embeddings Model.
#* You may also choose to set the Chunking Method property if you are using a large document set.
#* You may also choose to set the Chunking Method property if you are using a large document set.
Line 71: Line 71:
This portion emphasizes how individual configurations—AI Extract, OCR, and indexing—work together as a cohesive system, enabling a seamless transition from raw document ingestion to fully searchable, structured content.
This portion emphasizes how individual configurations—AI Extract, OCR, and indexing—work together as a cohesive system, enabling a seamless transition from raw document ingestion to fully searchable, structured content.


# Select the Machines folder node. Verify an Activity Processing and Import Watcher Service are installed and running.
# Select the [[Machine|Machines]] folder node. Verify an [[Activity Processing]] and [[Import Watcher]] Service are installed and running.
#* These are needed if you wish to run a Batch through production in an automated fashion by starting with an import. For our purposes, we'll be using the Batch Process Step tester tabs to check each step individually.
#* These are needed if you wish to run a [[Batch]] through production in an automated fashion by starting with an import. For our purposes, we'll be using the [[Batch Process Step]] tester tabs to check each step individually.
# Expand the Node Tree to the Test folder of the Batches node, then add a new "Test" Batch. Add a document, or documents, you wish to test processing with.
# Expand the Node Tree to the Test folder of the Batches node, then add a new "Test" Batch. Add a document, or documents, you wish to test processing with.
#* In this example we'll use a single document to test.
#* In this example we'll use a single document to test.
# Expand the Node Tree and select the "Split Pages" Batch Process Step from the provided "Ingest and Index (File Import)" Project, then click the Activity Tester tab.
# Expand the Node Tree and select the "[[Split Pages]]" Batch Process Step from the provided "Ingest and Index (File Import)" Project, then click the Activity Tester tab.
# Click the "Select Batch" button in the Batch Viewer, then be sure to select the Batch you recently created.
# Click the "Select Batch" button in the Batch Viewer, then be sure to select the Batch you recently created.
# Select the Folder Level 1 Batch Folder, or folders in the Batch Viewer, then click the "Test Activity" button.
# Select the Folder Level 1 [[Batch Folder]], or folders in the Batch Viewer, then click the "Test Activity" button.
#* If you have an Activity Processing service running, you can instead use the "Submit Job" button. This will be true for all steps moving forward.
#* If you have an Activity Processing service running, you can instead use the "Submit Job" button. This will be true for all steps moving forward.
# Select the "Recognize" Batch Process Step from the Node Tree, then expand the Batch Folder contents in the Batch viewer.
# Select the "[[Recognize]]" Batch Process Step from the Node Tree, then expand the Batch Folder contents in the Batch viewer.
# Select the Batch Page in the Batch Viewer, then click the "Test Activity" button.
# Select the [[Batch Page]] in the Batch Viewer, then click the "Test Activity" button.
# Select the "Extract" Batch Process Step from the Node Tree.
# Select the "[[Extract]]" Batch Process Step from the Node Tree.
# Select the Folder Level 1 Batch Folder from the Batch Viewer, then click the "Test Activity" button.
# Select the Folder Level 1 Batch Folder from the Batch Viewer, then click the "Test Activity" button.
# Select the "Review" Batch Process Step from the Node Tree.
# Select the "[[Review]]" Batch Process Step from the Node Tree.
# Select the Batch root from the Batch Viewer, then click the "Test Activity" button.
# Select the Batch root from the Batch Viewer, then click the "Test Activity" button.
# Review the extracted Data from the Data Viewer, then click the "Back to Design Page" button.
# Review the extracted Data from the Data Viewer, then click the "Back to Design Page" button.
Line 103: Line 103:
# Split Pages is not needed for this type of processing, but a Review activity with the Scan Viewer is, as well as an Image Processing activity to clean up the scanned pages.
# Split Pages is not needed for this type of processing, but a Review activity with the Scan Viewer is, as well as an Image Processing activity to clean up the scanned pages.
#* You'll also notice there is a Separate activity for turning the loose pages into Batch Folders. Keep in mind, in order to use the Scan Viewer, you will need Grooper Desktop installed on the system that will be doing the scanning.
#* You'll also notice there is a Separate activity for turning the loose pages into Batch Folders. Keep in mind, in order to use the Scan Viewer, you will need Grooper Desktop installed on the system that will be doing the scanning.
More information on Email processing and Scanning can be found with these links:
* [[Conditioning Emails]]
* [[Email Processing]]
* [[Scan Viewer|Scanning]]


<div style="position: relative; box-sizing: content-box; max-height: 80vh; max-height: 80svh; width: 100%; aspect-ratio: 1.78; padding: 40px 0 40px 0;"><iframe src="https://app.supademo.com/embed/cmnruwyfu00fsy90jyoshp69a?embed_v=2&utm_source=embed" loading="lazy" title="Ingest and Index - Considering Emails and Scanning" allow="clipboard-write" frameborder="0" webkitallowfullscreen="true" mozallowfullscreen="true" allowfullscreen style="position: absolute; top: 0; left: 0; width: 100%; height: 100%;"></iframe></div>
<div style="position: relative; box-sizing: content-box; max-height: 80vh; max-height: 80svh; width: 100%; aspect-ratio: 1.78; padding: 40px 0 40px 0;"><iframe src="https://app.supademo.com/embed/cmnruwyfu00fsy90jyoshp69a?embed_v=2&utm_source=embed" loading="lazy" title="Ingest and Index - Considering Emails and Scanning" allow="clipboard-write" frameborder="0" webkitallowfullscreen="true" mozallowfullscreen="true" allowfullscreen style="position: absolute; top: 0; left: 0; width: 100%; height: 100%;"></iframe></div>
Line 108: Line 113:
== For More Information ==
== For More Information ==
* [[AI Extract]]
* [[AI Extract]]
* [[Activity Processing]]
* [[Azure DI OCR]]
* [[Azure DI OCR]]
* [[Batch]]
* [[Batch Folder]]
* [[Batch Page]]
* [[Batch Process]]
* [[Batch Process]]
* [[Batch Process Step]]
* [[Behaviors]]
* [[Content Model]]
* [[Content Model]]
* [[Data Model]]
* [[Data Model]]
* [[Extract]]
* [[Fill Method]]
* [[Fill Method]]
* [[Import Watcher]]
* [[LLM Connector]]
* [[LLM Connector]]
* [[Machine]]
* [[Node Tree]]
* [[OCR Profile]]
* [[OCR Profile]]
* [[Project]]
* [[Project]]
* [[Recognize]]
* [[Repository]]
* [[Repository]]
* [[Review]]
* [[Root]]
* [[Search Page]]
* [[Search Page]]
* [[Split Pages]]

Latest revision as of 13:28, 21 April 2026

This article is about the current version of Grooper.

Note that some content may still need to be updated.

2025

You may download the ZIP files below for use in your own Grooper environment (version 2025). These are Project ZIP files.

Introduction

Ingest and Index (Simple Functionality) demonstrates how to take documents from an external source, process them with OCR and AI Extract, and ultimately add them to a searchable index within Grooper. This article brings together several core capabilities—document ingestion, text recognition, AI-driven data extraction, and search indexing—into a single, end-to-end workflow.

The intention of this article is to provide a foundational example of how Grooper can transform raw documents into searchable, structured, and retrievable content. Using a generic Content Model and Data Model, the Project captures common document information while also preparing both the document text and extracted data for indexing. This allows users to not only store documents, but also search and retrieve them using Grooper’s Search Page, including support for modern vector-based (semantic) search.

To support this workflow, the article walks through the required configuration across multiple components of Grooper. This includes setting up an LLM Connector for AI Extract, configuring Azure DI OCR for text recognition, defining indexing behavior (including optional vector embeddings and chunking strategies), and preparing a Batch Process that imports, processes, reviews, and indexes documents.

By the end of this guide, readers will understand how Grooper connects ingestion, AI extraction, and search into a unified pipeline—demonstrating how documents move from raw files to fully indexed assets that can be queried and explored through Grooper’s search capabilities.

Setup for AI Extract

This portion of the article focuses on configuring Grooper’s AI Extract capability so documents can be analyzed by a Large Language Model (LLM) and mapped into a Data Model. It involves setting up an LLM Connector within the Grooper Repository and selecting an appropriate model through the Data Model’s Fill Methods.

The goal of this configuration is to enable Grooper to interpret document content and populate generic fields—such as document identifiers, dates, and party information—without relying on rigid, template-based extraction. This setup establishes the connection between Grooper and the external LLM provider, ensuring AI Extract can execute during Batch Processing.

  1. Select the Root node, then click the ellipsis button for the Options property to open the Options editor.
  2. Add an LLM Connector, then be sure to properly configure it.
    • The most important configuration is choosing a service provider for the Service Provider property, and properly configuring it.
  3. Expand the Node Tree and select the Data Model from the provided "Ingest and Index (File Import)" Project, then click the ellipsis button for the Fill Methods property to open the "Fill Methods" editor.
  4. Expand the Generator sub-properties and be sure to select a desired model for the Model property.

Setup for Azure DI OCR

This section covers configuring the Azure DI OCR Profile, which is responsible for converting image-based content into machine-readable text. By supplying an Azure Computer Vision API key and matching the correct region, Grooper can leverage Azure DI's OCR engine to process scanned or image-only documents.

This step ensures that all documents—whether they contain embedded text or not—have usable text content for downstream processing. OCR output is critical not only for AI Extract, but also for search indexing, as it provides the textual data that both extraction models and search engines rely on.

  1. Select the Root node, then click the ellipsis button for the Options property to open the Options editor.
  2. In the "Options" editor, add an "Azure Document Intelligence" option, then properly configure it.
    • The most important property is the API Key.
  3. Expand the Node Tree and right-click the "Azure OCR" OCR Profile from the provided "Ingest and Index (File Import)" Project, then select "Rename" from the pop-out menu.
  4. Set the New Name property to "Azure DI OCR".
  5. Right-click the OCR Engine property, then select "Reset" from the pop-out menu.
  6. Set the OCR Engine property to "Azure DI OCR".

Setup for Search Index

This portion explains how to configure Grooper’s AI Search capabilities to index processed documents and make them searchable. It includes adding and configuring an AI Search option in the repository and defining Indexing Behavior on the Content Model.

Key elements of this setup include selecting an embeddings model for vector-based search and optionally enabling chunked indexing for large or text-heavy documents. Once configured, a search index is created and linked to the Content Model, allowing Grooper to store both document content and associated metadata in a way that supports fast and flexible retrieval.

  1. Select the Root node from the Node Tree, then click the ellipsis button for the Options property to open the "Options" editor.
  2. Add an "AI Search" option, then be sure to properly configure the "AI Search" option.
    • You'll need to provide a URL and API Key.
  3. Expand the Node Tree and select the "Indexed Documents (Imported)" Content Model from the provided "Ingest and Index (File Import)" Project. Then, click the ellipsis button for the Behaviors property to open the Behaviors editor.
  4. Select the "Indexing Behavior" Behavior. Expand the Vector Search sub-properties and select an Embeddings Model.
    • You may also choose to set the Chunking Method property if you are using a large document set.
  5. Right-click the "Indexed Documents (Imported)" Content Model, then select "Search" "Create Search Index" from the pop-out menu.
  6. Click the "Execute" button. This will create the Search Index in Azure where indexed information can now be stored and called later.

Final setup

The final section brings all components together into a complete, operational workflow. It covers preparing the necessary services (such as Activity Processing and Import Watcher), publishing the Batch Process, and configuring document ingestion from a file system.

This Batch Process orchestrates the full pipeline: importing documents, performing OCR, executing AI Extract, pausing for user validation in Review, and finally adding documents to the search index. After indexing is complete, documents can be queried and retrieved through the Grooper Search page.

This portion emphasizes how individual configurations—AI Extract, OCR, and indexing—work together as a cohesive system, enabling a seamless transition from raw document ingestion to fully searchable, structured content.

  1. Select the Machines folder node. Verify an Activity Processing and Import Watcher Service are installed and running.
    • These are needed if you wish to run a Batch through production in an automated fashion by starting with an import. For our purposes, we'll be using the Batch Process Step tester tabs to check each step individually.
  2. Expand the Node Tree to the Test folder of the Batches node, then add a new "Test" Batch. Add a document, or documents, you wish to test processing with.
    • In this example we'll use a single document to test.
  3. Expand the Node Tree and select the "Split Pages" Batch Process Step from the provided "Ingest and Index (File Import)" Project, then click the Activity Tester tab.
  4. Click the "Select Batch" button in the Batch Viewer, then be sure to select the Batch you recently created.
  5. Select the Folder Level 1 Batch Folder, or folders in the Batch Viewer, then click the "Test Activity" button.
    • If you have an Activity Processing service running, you can instead use the "Submit Job" button. This will be true for all steps moving forward.
  6. Select the "Recognize" Batch Process Step from the Node Tree, then expand the Batch Folder contents in the Batch viewer.
  7. Select the Batch Page in the Batch Viewer, then click the "Test Activity" button.
  8. Select the "Extract" Batch Process Step from the Node Tree.
  9. Select the Folder Level 1 Batch Folder from the Batch Viewer, then click the "Test Activity" button.
  10. Select the "Review" Batch Process Step from the Node Tree.
  11. Select the Batch root from the Batch Viewer, then click the "Test Activity" button.
  12. Review the extracted Data from the Data Viewer, then click the "Back to Design Page" button.
  13. Select the "Add to Index" Batch Process Step from the Node Tree.
  14. Select the Folder Level 1 Batch Folder from the Batch Viewer, then click the "Test Activity" button.
  15. Click the "Search Page" button.
  16. Select the appropriate Search Index from the drop-down selector in the top right of the UI.
  17. You can now run queries and find documents with the Search Page.

Considering emails and scanning

In this final section we'll take a quick look at the other two provided sample Projects and see how their Batch Processes differ when considering email processing and scanning.

  1. A Project similiar to the "File Import" Project is provided, but it is suited for Email processing.
  2. Before the "Split Pages" Batch Process Step are several Batch Process Steps that are specific to email processing.
    • Feel free to look at the configuration of these steps to learn more about them. Not all of these steps are needed for all types of email processing, but this is a generic Batch Process that is built as a "one size fits all" scenario. In order to use this Batch Process you'll need to use the Imports Page and leverage a CMIS connection configured to leverage your email system.
  3. There is also a Project provided that is specific to scanning documents.
  4. Split Pages is not needed for this type of processing, but a Review activity with the Scan Viewer is, as well as an Image Processing activity to clean up the scanned pages.
    • You'll also notice there is a Separate activity for turning the loose pages into Batch Folders. Keep in mind, in order to use the Scan Viewer, you will need Grooper Desktop installed on the system that will be doing the scanning.

More information on Email processing and Scanning can be found with these links:

For More Information