AI Search: Difference between revisions
Dgreenwood (talk | contribs) |
Dgreenwood (talk | contribs) |
||
| Line 31: | Line 31: | ||
#*<li class="fyi-bullet"> This creates the search index in Azure. Without the search index created, documents can't be added to an index. This only needs to be done ''once'' per index. | #*<li class="fyi-bullet"> This creates the search index in Azure. Without the search index created, documents can't be added to an index. This only needs to be done ''once'' per index. | ||
# Submit an "Indexing Job" to index any documents classified using the '''Content Model''' ''currently'' in the Grooper Repository. To do this, right-click the '''Content Model''' and select "Search > Submit Indexing Job". | # Submit an "Indexing Job" to index any documents classified using the '''Content Model''' ''currently'' in the Grooper Repository. To do this, right-click the '''Content Model''' and select "Search > Submit Indexing Job". | ||
#*<li class="attn-bullet"> '''''BE AWARE:''''' An '''Activity Processing''' service must be running to execute the Indexing Job. | |||
#*<li class="fyi-bullet"> This is one of many ways to index documents using AI Search. For a full list (including a ways to automate document indexing) [[#Adding documents to the search index|see below]]. | #*<li class="fyi-bullet"> This is one of many ways to index documents using AI Search. For a full list (including a ways to automate document indexing) [[#Adding documents to the search index|see below]]. | ||
Revision as of 14:58, 23 July 2024
|
2025 BETA |
This article covers new or changed functionality in the current or upcoming beta version of Grooper. Features are subject to change before version 2025's GA release. Configuration and functionality may differ from later beta builds and the final 2025 release. |
AI Search is a Grooper Repository Option. Enabling this option creates an efficient and effective document search and retrieval mechanism in Grooper using Azure's AI Search service.
About
AI Search is a Grooper Repository Option. When enabled from the Grooper Root, this gives users the ability to create a document search and retrieval mechanism in Grooper. This integrates Grooper with an existing Azure AI Search service. With this integration, Grooper can:
- Create a search index for documents of a specified Content Type (Content Model)
- Add documents assigned this Content Type (Document Types in the Content Model) to the search index.
- Use the Search page to search for documents in the search index.
- Simple full text search is supported.
- More advanced search querying is supported through Azure's implementation of the Lucene query syntax and OData filter syntax.
Basic AI Search Setup
Before you can start using the Search page to search for documents, there's some basic setup you need to perform. Some of these steps are performed outside of Grooper. Most are performed inside of Grooper.
Outside of Grooper
- Create an AI Search service in Azure.
- The following article from Microsoft instructs users how to create a Search Service:
- Microsoft's full AI Search documentation is found here:
- You will need the Azure AI Search service's "URL" and either "Primary admin key" or "Secondary admin key" for the next step. These values can be found by accessing the Azure Search service from the Azure portal (portal.azure.com).
Inside of Grooper
- Add AI Search to the Grooper Root node's Repository Options. Enter the URL and admin key for the Azure AI Search service (copied from Azure).
- Add an Indexing Behavior on a Content Model.
- Documents must be classified in Grooper before they can be indexed. Only Document Types/Content Types inheriting an Indexing Behavior are eligible for indexing.
- Create the search index. To do this, right-click the Content Model and select "Search > Create Search Index"
- This creates the search index in Azure. Without the search index created, documents can't be added to an index. This only needs to be done once per index.
- Submit an "Indexing Job" to index any documents classified using the Content Model currently in the Grooper Repository. To do this, right-click the Content Model and select "Search > Submit Indexing Job".
- BE AWARE: An Activity Processing service must be running to execute the Indexing Job.
- This is one of many ways to index documents using AI Search. For a full list (including a ways to automate document indexing) see below.
Repository Options: AI Search
Repository Options are new to Grooper 2024. They add new functionality to the whole Grooper Repository. These optional features are added using the Options property editor on the Grooper Root node.
To search documents in Grooper, we use Azure's AI Search service. In order to connect to an Azure AI Search service, the AI Search option must be added to the list of Repository Options in Grooper. Here, users will enter the Azure AI Search URL endpoint where calls are issued and an admin's API key. Both of these can be obtained from the Microsoft Azure portal once you have added an Azure AI Search resource.
With AI Search added to your Grooper Repository, you will be able to add an Indexing Behavior to one or more Content Types, create a search index, index documents and search them using the Search Page.
Indexing documents for search
Before documents can be searched, they must be indexed. The search index holds the content you want to search. This includes each document's full OCR or native text obtained from the Recognize activity and can optionally include Data Model results collected from the Extract activity. We use the Azure AI Search Service to create search indexes according to an Indexing Behavior defined for Content Types in Grooper. Documents are made searchable by adding them to a search index. Once indexed, you can search for documents using Grooper's Search page.
The Indexing Behavior: Defines the search index
Before indexing documents, you must add an Indexing Behavior to the Content Types you want to index. Most typically, this will be done on a Content Model. All child Document Types will inherit the Indexing Behavior and its configuration (More complicated Content Models may require Indexing Behaviors configured on multiple Content Types).
The Indexing Behavior defines:
- The index's name in Azure.
- Which documents are added to the index.
- Only documents who are classified as the Indexing Behavior's Content Type OR any of its children Content Types will be indexed.
- In other words, when set on a Content Model only documents classified as one of its Document Types will be indexed.
- What fields are added to the search index (including which Data Elements from a Data Model are included, if any).
- Any options for the search index in the Grooper Search page (included access restriction to the search index).
|
⚠ |
BE AWARE: Once an Indexing Behavior is added to a Content Type, you must use the "Create Search Index" command to create the index in Azure. Do this by right-clicking the Content Type and choosing "Search > Create Search Index". |
With the Indexing Behavior defined, and the search index created, now you can start indexing documents.
Adding documents to the search index
Documents may be added to a search index in one of the following ways:
- Using the "Add to Index" command.
- This is the most "manual" way of doing things.
- Select one or more documents, right-click them and select "Search > Add to Index" to add only the selected documents to the search index.
- Documents may also be manually removed from the search index in this way by using the "Remove From Index" command.
- Using the "Submit Indexing Job" command.
- This is a manual way of indexing all existing documents for the Content Model.
- The Indexing Job will add newly classified documents to the index, update the index if changes are made (to their extracted data for example), and remove documents from the index if they've been deleted.
- Select the Content Model, right-click it and select "Search > Submit Indexing Job".
- BE AWARE: An Activity Processing service must be running to execute the Indexing Job.
- Using an Execute activity in a Batch Process to apply the "Add to Index" command to all documents in a Batch.
- This is one way to automate document indexing.
- Bear in mind, if documents or their data change after this step would run, they would still need to be re-indexed after changes are made.
- Running the Grooper Indexing Service to index documents automatically in the background.
- This is the most automated way to index documents.
- The Grooper Indexing Service periodically polls the Grooper database to determine if the index needs to be updated. If it does, it will submit an "Indexing Job".
- The Indexing Job will add newly classified documents to the index, update the index if changes are made (to their extracted data for example), and remove documents from the index if they've been deleted.
- The Indexing Behavior's Auto Index property must also be enabled for the Indexing Service to sub
- BE AWARE: An Activity Processing service must be running to execute the Indexing Job(s).
The Search Page
Once you've got indexed documents, you can start searching for documents in the search index! The Search page allows you to find documents in your search index.
The Search page allows you to build a search query using four components:
- Search: This is the only required parameter. Here, you will enter your search terms, using the Lucene query syntax.
- Filter: An optional filter to set inclusion/restriction criteria for documents returned, using the OData syntax.
- Select: Optionally selects which fields you want displayed for each document.
- Order By: Optionally orders the list of documents returned.
Search
The Search configuration searches the full text of each document in the index. This uses the Lucene query syntax to return documents. For a simple search query, just enter a word or phrase (enclosed in quotes "") in the Search editor. Grooper will return a list of any documents with that word or phrase in their text data.
Lucene also supports several advanced querying features, including:
- Wildcard searches:
?and*- Use
?for a single wildcard character and*for multiple wildcard characters.
- Use
- Fuzzy matching:
searchTerm~- Fuzzy search can only be applied to terms. Fuzzy searched phrases should not be enclosed in quotes. Azure's full fuzzy search documentation can be found here: https://learn.microsoft.com/en-us/azure/search/search-query-fuzzy
- Regular expression matching:
\regex\- Enclose a regex pattern in backslashes to incorporate it into the Lucene query. For example,
\\d{3}[a-z]\
- Enclose a regex pattern in backslashes to incorporate it into the Lucene query. For example,
- Boolean operators:
ANDORNOT- Boolean operators can help improve the precision of search query.
- Field searching:
fieldName:searchExpression- Search built in fields and extracted Data Model values. For example,
Invoice_No:8*would return any document whose extracted "Invoice No" field started with the number "8"
- Search built in fields and extracted Data Model values. For example,
Azure's full documentation of Lucene query syntax can be found here: https://learn.microsoft.com/en-us/azure/search/query-lucene-syntax
Filter
First you search, then you filter. The Filter parameter specifies criteria for documents to be included or excluded from the search results. This gives users an excellent mechanism to further fine tune their search query. Commonly, users will want to filter a search set based on the field values. Both built in index fields and/or values extracted from a Data Model can be incorporated into the filter criteria.
Azure AI Search uses the OData syntax to define filter expressions. Azure's full OData syntax documentation can be found here: https://learn.microsoft.com/en-us/azure/search/search-query-odata-filter
Select
The Select parameter defines what field data is returned in the result list. You can select any of the built in fields or Data Elements defined in the Indexing Behavior. This can be exceptionally helpful when navigating indexes with a large number of fields. Multiple fields can be selected using a comma separated list (e.g. Field1,Field2,Field3)
Order By
Order By is an optional parameter that will define how the search results are sorted.
- Any field in the index can be used to sort results.
- The field's value type will determine how items are sorted.
- String values are sorted alphabetically.
- Datetime values are sorted by oldest or newest date.
- Numerical value types are sorted smallest to largest or largest to smallest.
- Sort order can be ascending or descending.
- Add
ascafter the field's name to sort in ascending order. This is the default direction. - Add
descafter the field's name to sort in ascending order.
- Add
- Multiple fields may be used to sort results.
- Separate each sort expression with a comma (e.g.
Field1 desc,Field2) - The leftmost field will be used to sort the full result list first, then it's sub-sorted by the next, then sub-sub-sorted by the next, and so on.
- Separate each sort expression with a comma (e.g.