2024:AI Search and the Search Page: Difference between revisions

Revision as of 08:24, 20 August 2024

2025 BETA

This article covers new or changed functionality in the current or upcoming beta version of Grooper. Features are subject to change before version 2025's GA release. Configuration and functionality may differ from later beta builds and the final 2025 release.

You may download the ZIP(s) below and upload it into your own Grooper environment (version 2024). The first contains one or more Batches of sample documents. The second contains one or more Projects with resources used in examples throughout this article.

Glossary

About

Put simply, Azure AI Search will make it easier to store and retrieve your documents in Grooper. To understand how, let's first understand what Grooper has been.

Historically Grooper has been a transient platform for document processing:

documents come in
data is collected from those documents
the data and documents are pushed out of Grooper to some place

It has never been a place to store documents and/or their data.

While it has been possible to keep Batches and their content in Grooper it has never been a best practice, nor has it been convenient to do so. You could, theoretically, devise some kind of hierarchical folder and naming convention by which you organize Batches in the node tree, but this is very time consuming and is probably not even that useful. Say you wanted to retrieve all "Invoices" that have a "Total Amount" over "$1,000.00". Without "indexing" the documents and their data, and the ability to "query" that index, this would be extremely time consuming at best, even if they're nicely organized. The criteria by which you organize something one day might not align with the method by which you choose to search for them later.

By using Grooper's implementation of Azure AI Search you will be able to quickly and efficiently index your documents and their data to allow for ease of retrieval as well as gain a deeper understanding of them.

Microsoft Azure AI Search

Azure AI Search, formerly known as Azure Cognitive Search, is a cloud-based search-as-a-service solution provided by Microsoft Azure. It has allowed our developers to build a sophisticated search experience into Grooper. Here are some key features and capabilities:

Full-Text Search: Azure AI Search supports full-text search with capabilities like faceting, filtering, and scoring, allowing users to search through large volumes of text efficiently.
Customizable Indexing: Developers can define custom indexes tailored to their specific data schema. This flexibility allows for a more relevant and precise search experience.
Scalability: The service can scale up or down based on the workload, making it suitable for applications of all sizes.
Security and Compliance: Azure AI Search ensures data security and compliance with industry standards, offering features like role-based access control (RBAC), data encryption, and integration with Active Directory.
APIs and SDKs: Azure AI Search provides REST APIs and client libraries for various programming languages, making it easy to integrate with different types of applications.

Integration with Grooper

API Integration: Grooper can leverage Azure AI Search's REST APIs to automate the indexing of documents and retrieval of search results. This integration can be built into Grooper's workflow to ensure seamless data processing and search capabilities.
Security and Compliance: Both Grooper and Azure AI Search offer robust security features. Integrating these ensures that document processing and search operations are secure and compliant with industry standards.
Indexing Processed Documents: Once Grooper processes and extracts data from documents, this data can be sent to Azure AI Search for indexing. This allows users to search through the processed data quickly and efficiently.
- Indexing is an intake process that loads content into Azure AI Search service and makes it searchable. Through Azure AI Search, inbound text is processed into tokens and stored in inverted indexes, and inbound vectors are stored in vector indexes. The document format that Azure AI Search can index is JSON.
Querying Indexed Documents and Data: Once Azure AI Search has indexed documents and their data from Grooper, user's can leverage powerful query syntax like Lucene and OData to efficiently retrieve the information from their documents.
- Querying can happen once an index is populated with searchable content, when Grooper sends query requests to a search service and handles responses. All query execution is over a search index that you control.

How To

Using Azure AI Search will require a few setup steps:

Create an Azure AI Search Service
- This is the only step done outside of Grooper.)
Configure the AI Search Repository Option
Create the Search Index
Index Documents and Data from Grooper
Use the Search Page

Create an Azure AI Search Service

The following article from Microsoft instructs users how to create a Search Service:
https://learn.microsoft.com/en-us/azure/search/search-create-service-portal
Microsoft's full AI Search documentation is found here:
https://learn.microsoft.com/en-us/azure/search/
You will need the Azure AI Search service's "URL" and either "Primary admin key" or "Secondary admin key" for the next step. These values can be found by accessing the Azure Search service from the Azure portal (portal.azure.com).

Configure the AI Search Repository Option

To search documents in Grooper, we use Azure's AI Search service. In order to connect to an Azure AI Search service, the AI Search option must be added to the list of Repository Options in Grooper. Here, users will enter the Azure AI Search URL endpoint where calls are issued and an admin's API key. Both of these can be obtained from the Microsoft Azure portal once you have added an Azure AI Search resource.

With AI Search added to your Grooper Repository, you will be able to add an Indexing Behavior to one or more Content Types, create a search index, index documents and search them using the Search Page.

Select the root object in the node tree.
Click the ellipsis button on the Options property.
Click the "Add" button in the "Options" window.
Select AI Search from the drop-down menu.

Enter the Azure AI Search URL into the URL property.
Add your Azure AI Search API key to the API Key property.
Click the "OK" button to close the "Options" window.
Click the "Save" button to save all changes.

Configure an Indexing Behavior on a Content Type

Before indexing documents, you must add an Indexing Behavior to the Content Types you want to index. Most typically, this will be done on a Content Model. All child Document Types will inherit the Indexing Behavior and its configuration (More complicated Content Models may require Indexing Behaviors configured on multiple Content Types).

The Indexing Behavior defines:

General

The index's Name in Azure.
Which documents are added to the index.
- Only documents who are classified as the Indexing Behavior's Content Type OR any of its children Content Types will be indexed.
- In other words, when set on a Content Model only documents classified as one of its Document Types will be indexed.
What Included Elements are added to the search index (including which Data Elements from a Data Model are included, if any).
What Built in Fields are added to the search index. Note, if you leverage any of these built in fields and also want to use Included Elements there cannot be naming conflicts between the Included Elements and the Built in Fields. The Built in Fields are typical meta-data points including:
- Content: Index the full text content of the document. This would be the text generated by the Recognize activity.
- Attachment Name: Index the document's attachment filename. This would be the original name of the file as it existed before being acquired by Grooper.
- Type Name: Index the name of the document's Content Type.
- Page Count: Index the number of pages within the document.
- Flag Message: Index the flag message associated with the document. This would include auto-generated messages like whether or not "required" fields were empty, a type of validation error, or even null.
- Path: Index the path in the "Batches" folder of the node three where the document exists.
- All: Enable all Built in Elements.
Page Limit: The maximum number of page to include when indexing the full text content of a document.
Flatten: Specifies that the search index should be flattened. "Flattening" a search index generally refers to the process of transforming a hierarchical or nested data structure into a flat, non-hierarchical structure. In the context of a search index, this could involve several different actions depending on the specific needs and the data structure being indexed.
Auto Index: If set, specifies that the Indexing Service should automatically add new documents to this search index. When not set, the Indexing Service will still remove deleted documents and update changed documents already in the search index.

Search Page Options

Access List: If set, specifies a restricted set of users who may search this index. If not set, all authenticated users may search this index.
AI Analysts: An optional list of AI Analysts available for chat sessions regarding the search result set.
Generators: An optional list of AI Generators to be available for generating documents from the search result set. This is a collection of LLM Models, Instructions, and Examples that define how an AI would structure said documents.

Select the Content Model from the provided Project.
Click the ellipsis button for the Behaviors property.
In the "Behaviors" window click the "Add" button.
Select Indexing Behavior from the drop-down menu.

An Indexing Behavior will be added to the collection.
For our purposes the bulk of the properties can be left to their default setting. The only thing we'll change is the Included Elements property. Click the ellipsis button for this property.
In the "Included Elements" window ALT+LeftClick the Content Model to select it and all child elements.
Click the "OK" button to close the "Included Elements" window.
Click the "OK" button to close the "Behaviors" window.
Be sure to save all changes.

Create the Search Index

This will create the search index in Azure. Without the search index created, documents can't be added to an index. This only needs to be done once per index.

Right-click on the Content Model from the provided Project.
Choose "Search > Create Search Index" from the pop-out menu.
Click the "Execute" button in the "Create Search Index" window.

If you navigate to your Azure AI Search resource in a web browser...
...and go to your indexes...
...you will see a new index named after the Name property of the Indexing Behavior of the Content Type this command was used against.

Index Documents and Data from Grooper

With the search index created you can now add data to the search index. Documents must be classified in Grooper before they can be indexed. Only Document Types/Content Types inheriting an Indexing Behavior are eligible for indexing.

"Add to Index" Batch Folder Object Command

This is the most manual way of adding to the search index. This is an object command done on a per document basis (or via multi-selecting) in a Batch Viewer.

Select the provided Batch from the node tree.
Click on the "Viewer" tab.
Notice that the document in this Batch is classified as a Document Type that is inheriting from the Content Model with the Indexing Behavior.
Right-click on the document from the provided Batch in the Batch Viewer.
Select "Search > Add To Index" from the pop-out menu.
Click "Execute" in the "Add to Index" window.

Navigate to your Azure AI Search resource in a web browser.
Click the "Indexes" button.
Click on the listed index in use.

Click the "Search" button to perform an open query.
You should see the JSON results of the performed query in the "Results" portion of the site.z

"Submit Indexing Job" Content Type Object Command

This is another manual approach as it also involves an object command. Because this command is applied to a Content Type, however, it will index all documents that are classified as that Content Type or inherit from it.

First things first, we need to make sure an Activity Processing Service is running for our repository.

If you click on the "Machines" folder in the node tree...
...and select the Grooper server where your repository is hosted...
...you should see an Activity Processing Service is installed and running.
- If you do not see the appropriate service for your repository of Grooper, please visit the Grooper Command Console article for information on installing the appropriate service.

Once you've confirmed an Activity Processing Service is installed and running, you can use the "Submit Indexing Job" object command.

Right-click the Content Model from the provided Project.
Select "Search > Submit Indexing Job" from the pop-out menu.
Take note of the "Added Documents", "Updated Documents", and "Deleted Documents" properties.
Click the “Execute” button in the “Submit Indexing Job” window.
- If there are no “added”, “updated”, or “deleted” documents, the window will close and no job will be submitted.
- If there are, however, an “Indexing Job” will be created and the active “Activity Processing Service” will complete the tasks of the job to update the index.

Execute Activity with "Add to Index" Command

This is an automated approach as it will create an "Indexing Job" as part of a Batch Process. This will perform the exact same command as the "Add to Index" object command explained earlier. When an Execute step is reached in a Batch Process a job will be created with a task for each document in scope.

Keep in mind, if the document in scope does not need to be added, updated, or deleted, no task will be created. If that is true for all documents in scope, no job will be created.
Also keep in mind that an Activity Processing Service will need to be installed and running for the given repository in order for the job to be picked up and worked.

Right-click the provided Project.
Select "Add > Batch Process" from the pop-out menu.
Name the Batch Process.
Click the "Execute" button from the "Add" window.

Right-click on the newly created Batch Process.
Select "Add Activity > Utilities > Execute" from the pop-out menu.
The step will be named based on the Activity chosen.
Click the "Execute" button in the "Add Activity" window.

Set the Scope and Folder Level properties.

For our purposes a Scope of Folder, and a Folder Level of 1 are accurate.

Click the ellipsis button for the Steps property.
Click the "Add" button in the "Steps" window.
Choose Execute Command from the drop-down window.

This will add an "Execute Command" to the collection of "Steps".
Click the drop-down button for the Command property.
Select Batch-Folder > Add to Index from the drop-down menu.
Click the "OK" button from the "Steps" window.
Be sure to save all changes.

Click on the "Activity Tester" tab.
Select the document from the provided Batch in the "Batch Viewer".
Click the "Start" button to create an "Indexing Job" for the selected document.

Indexing Service

This is the most automated way to index documents. The Indexing Service will periodically poll the Grooper database to determine if classified documents that inherit from a Content Type with an Indexing Behavior need to be added, updated, or deleted. If it does, it will submit an "Indexing Job" with tasks for each document that needs to be added, updated, or deleted.

Keep in mind how the Auto Index property of an Indexing Behavior described above affects this service.

If set: will add, update, and/or remove documents from the index

If not set: will only remove deleted documents or update changed documents already in the search index

Also keep in mind:

an Indexing Servicer' will need to be installed and running for the given repository in order for the job to be "Indexing Job" to be created.

an Activity Processing Service will need to be installed and running for the given repository in order for the job to be picked up and worked

First things first, we need to make sure an Indexing Service and an Activity Processing Service are running for our repository.

If you click on the "Machines" folder in the node tree...
...and select the Grooper server where your repository is hosted...
...you should see an Activity Processing Service is installed and running.
- If you do not see the requisite services for your repository of Grooper, please visit the Grooper Command Console article for information on installing the appropriate services.

At this point there's nothing really left to do but let the service poll the database and look for updates to submit to the index.

Use the Search Page

Once you've got indexed documents, you can start searching for documents in the search index! The Search page allows you to find documents in your search index.

The Search page allows you to build a search query using four components:

Search: This is the only required parameter. Here, you will enter your search terms, using the Lucene query syntax.
Filter: An optional filter to set inclusion/restriction criteria for documents returned, using the OData syntax.
Select: Optionally selects which fields you want displayed for each document.
Order By: Optionally orders the list of documents returned.

Search

The Search configuration searches the full text of each document in the index. This uses the Lucene query syntax to return documents. For a simple search query, just enter a word or phrase (enclosed in quotes "") in the Search editor. Grooper will return a list of any documents with that word or phrase in their text data.

Lucene also supports several advanced querying features, including:

Wildcard searches: ? and *
Use ? for a single wildcard character and * for multiple wildcard characters.
Fuzzy matching: searchTerm~
Fuzzy search can only be applied to terms. Fuzzy searched phrases should not be enclosed in quotes. Azure's full fuzzy search documentation can be found here: https://learn.microsoft.com/en-us/azure/search/search-query-fuzzy
Regular expression matching: /regex/
Enclose a regex pattern in backslashes to incorporate it into the Lucene query. For example, /\d{3}[a-z]/
Boolean operators: AND OR NOT
Boolean operators can help improve the precision of search query.
Field searching: fieldName:searchExpression
Search built in fields and extracted Data Model values. For example, Invoice_No:8* would return any document whose extracted "Invoice No" field started with the number "8"

Azure's full documentation of Lucene query syntax can be found here: https://learn.microsoft.com/en-us/azure/search/query-lucene-syntax

Filter

First you search, then you filter. The Filter parameter specifies criteria for documents to be included or excluded from the search results. This gives users an excellent mechanism to further fine tune their search query. Commonly, users will want to filter a search set based on the field values. Both built in index fields and/or values extracted from a Data Model can be incorporated into the filter criteria.

Azure AI Search uses the OData syntax to define filter expressions. Azure's full OData syntax documentation can be found here: https://learn.microsoft.com/en-us/azure/search/search-query-odata-filter

Select

The Select parameter defines what field data is returned in the result list. You can select any of the built in fields or Data Elements defined in the Indexing Behavior. This can be exceptionally helpful when navigating indexes with a large number of fields. Multiple fields can be selected using a comma separated list (e.g. Field1,Field2,Field3)

Order By

Order By is an optional parameter that will define how the search results are sorted.

Any field in the index can be used to sort results.
The field's value type will determine how items are sorted.
- String values are sorted alphabetically.
- Datetime values are sorted by oldest or newest date.
- Numerical value types are sorted smallest to largest or largest to smallest.
Sort order can be ascending or descending.
- Add asc after the field's name to sort in ascending order. This is the default direction.
- Add desc after the field's name to sort in ascending order.
Multiple fields may be used to sort results.
- Separate each sort expression with a comma (e.g. Field1 desc,Field2)
- The leftmost field will be used to sort the full result list first, then it's sub-sorted by the next, then sub-sub-sorted by the next, and so on.

@@ Line 281: / Line 281: @@
 ==== Search ====
 The '''''Search''''' configuration searches the full text of each document in the index. This uses the Lucene query syntax to return documents.  For a simple search query, just enter a word or phrase (enclosed in quotes <code>""</code>) in the '''''Search''''' editor.  Grooper will return a list of any documents with that word or phrase in their text data.
-[[image: 2024_Azure-AI-Search_08_01.png]]
 Lucene also supports several advanced querying features, including: