2024:AI Search and the Search Page: Difference between revisions

From Grooper Wiki
No edit summary
Line 33: Line 33:
* '''Security and Compliance''': Azure AI Search ensures data security and compliance with industry standards, offering features like [https://en.wikipedia.org/wiki/Role-based_access_control role-based access control (RBAC)], data encryption, and integration with [https://en.wikipedia.org/wiki/Active_Directory Active Directory].
* '''Security and Compliance''': Azure AI Search ensures data security and compliance with industry standards, offering features like [https://en.wikipedia.org/wiki/Role-based_access_control role-based access control (RBAC)], data encryption, and integration with [https://en.wikipedia.org/wiki/Active_Directory Active Directory].
* '''APIs and SDKs''': Azure AI Search provides [https://en.wikipedia.org/wiki/REST REST APIs] and client libraries for various programming languages, making it easy to integrate with different types of applications.
* '''APIs and SDKs''': Azure AI Search provides [https://en.wikipedia.org/wiki/REST REST APIs] and client libraries for various programming languages, making it easy to integrate with different types of applications.
Need to set up your own AI Search service in Azure? [[Grooper and AI#For Azure AI Search connectivity|Check out our quickstart guide to get started]]


==== External links ====
==== External links ====

Revision as of 12:02, 23 January 2025

This article is about an older version of Grooper.

Information may be out of date and UI elements may have changed.

20252024

You may download the ZIP(s) below and upload it into your own Grooper environment (version 2024). The first contains one or more Batches of sample documents. The second contains one or more Projects with resources used in examples throughout this article.

About

Put simply, Azure AI Search will make it easier to store and retrieve your documents in Grooper. To understand how, let's first understand what Grooper has been.

Historically Grooper has been a transient platform for document processing:

  • documents come in
  • data is collected from those documents
  • the data and documents are pushed out of Grooper to some place

It has never been a place to store documents and/or their data.

While it has been possible to keep Batches and their content in Grooper it has never been a best practice, nor has it been convenient to do so. You could, theoretically, devise some kind of hierarchical folder and naming convention by which you organize Batches in the node tree, but this is very time consuming and is probably not even that useful. Say you wanted to retrieve all "Invoices" that have a "Total Amount" over "$1,000.00". Without "indexing" the documents and their data, and the ability to "query" that index, this would be extremely time consuming at best, even if they're nicely organized. The criteria by which you organize something one day might not align with the method by which you choose to search for them later.

By using Grooper's implementation of Azure AI Search you will be able to quickly and efficiently index your documents and their data to allow for ease of retrieval as well as gain a deeper understanding of them.

Microsoft Azure AI Search

Azure AI Search, formerly known as Azure Cognitive Search, is a cloud-based search-as-a-service solution provided by Microsoft Azure. It has allowed our developers to build a sophisticated search experience into Grooper. Here are some key features and capabilities:

  • Full-Text Search: Azure AI Search supports full-text search with capabilities like faceting, filtering, and scoring, allowing users to search through large volumes of text efficiently.
  • Customizable Indexing: Developers can define custom indexes tailored to their specific data schema. This flexibility allows for a more relevant and precise search experience.
  • Scalability: The service can scale up or down based on the workload, making it suitable for applications of all sizes.
  • Security and Compliance: Azure AI Search ensures data security and compliance with industry standards, offering features like role-based access control (RBAC), data encryption, and integration with Active Directory.
  • APIs and SDKs: Azure AI Search provides REST APIs and client libraries for various programming languages, making it easy to integrate with different types of applications.

Need to set up your own AI Search service in Azure? Check out our quickstart guide to get started

External links

Integration with Grooper

  • API Integration: Grooper can leverage Azure AI Search's REST APIs to automate the indexing of documents and retrieval of search results. This integration can be built into Grooper's workflow to ensure seamless data processing and search capabilities.
  • Security and Compliance: Both Grooper and Azure AI Search offer robust security features. Integrating these ensures that document processing and search operations are secure and compliant with industry standards.
  • Indexing Processed Documents: Once Grooper processes and extracts data from documents, this data can be sent to Azure AI Search for indexing. This allows users to search through the processed data quickly and efficiently.
    • Indexing is an intake process that loads content into Azure AI Search service and makes it searchable. Through Azure AI Search, inbound text is processed into tokens and stored in inverted indexes, and inbound vectors are stored in vector indexes. The document format that Azure AI Search can index is JSON.
  • Querying Indexed Documents and Data: Once Azure AI Search has indexed documents and their data from Grooper, user's can leverage powerful query syntax like Lucene and OData to efficiently retrieve the information from their documents.
    • Querying can happen once an index is populated with searchable content, when Grooper sends query requests to a search service and handles responses. All query execution is over a search index that you control.

How To

Integrating Azure AI Search with Grooper will require a few setup steps:

  1. Create an Azure AI Search Service
    • This is the only step done outside of Grooper.
  2. Configure the AI Search Repository Option
  3. Configure an Indexing Behavior on a Content Type
  4. Create the search index
  5. Index documents and data from Grooper

Create an Azure AI Search Service

Configure the AI Search Repository Option

To search documents in Grooper, we use Azure's AI Search service. In order to connect to an Azure AI Search service, the AI Search option must be added to the list of Repository Options in Grooper. Here, users will enter the Azure AI Search URL endpoint where calls are issued and an admin's API key. Both of these can be obtained from the Microsoft Azure portal once you have added an Azure AI Search resource.

With AI Search added to your Grooper Repository, you will be able to add an Indexing Behavior to one or more Content Types, create a search index, index documents and search them using the Search Page.

  1. Select the root object in the node tree.
  2. Click the ellipsis button on the Options property.
  3. Click the "Add" button in the "Options" window.
  4. Select AI Search from the drop-down menu.


  1. Enter the Azure AI Search URL into the URL property.
  2. Add your Azure AI Search API key to the API Key property.
  3. Click the "OK" button to close the "Options" window.
  4. Click the "Save" button to save all changes.

Configure an Indexing Behavior on a Content Type

An Indexing Behavior allows documents (folder Batch Folders) to be indexed via AI Search. Once indexed, users can search for and retrieve documents from the Search Page.

Before indexing documents, you must add an Indexing Behavior to the Content Types you want to index. Most typically, this will be done on a Content Model. All child Document Types will inherit the Indexing Behavior and its configuration (More complicated Content Models may require Indexing Behaviors configured on multiple Content Types).


The Indexing Behavior defines:

General

  • The index's Name in Azure.
    • Be aware, Azure has some naming rules for its index names and metadata field names. These can be found at the link below.
    https://learn.microsoft.com/en-us/rest/api/searchservice/naming-rules
  • Which documents are added to the index.
    • Only documents who are classified as the Indexing Behavior's Content Type OR any of its children Content Types will be indexed.
    • In other words, when set on a Content Model only documents classified as one of its Document Types will be indexed.
  • What Included Elements are added to the search index (including which Data Elements from a Data Model are included, if any).
  • What Built in Fields are added to the search index. Note, if you leverage any of these built in fields and also want to use Included Elements there cannot be naming conflicts between the Included Elements and the Built in Fields. The Built in Fields are typical meta-data points including:
    • Content: Index the full text content of the document. This would be the text generated by the Recognize activity.
    • Attachment Name: Index the document's attachment filename. This would be the original name of the file as it existed before being acquired by Grooper.
    • Type Name: Index the name of the document's Content Type.
    • Page Count: Index the number of pages within the document.
    • Flag Message: Index the flag message associated with the document. This would include auto-generated messages like whether or not "required" fields were empty, a type of validation error, or even null.
    • Path: Index the path in the "Batches" folder of the node three where the document exists.
    • All: Enable all Built in Elements.
  • Page Limit: The maximum number of pages to include when indexing the full text content of a document.
  • Flatten: Specifies that the search index should be flattened. "Flattening" a search index generally refers to the process of transforming a hierarchical or nested data structure into a flat, non-hierarchical structure. In the context of a search index, this could involve several different actions depending on the specific needs and the data structure being indexed.
  • Auto Index: If set to True, specifies that the Indexing Service should automatically add new documents to this search index, remove deleted documents from the index and update "changed" documents present in the search index. When set to False (default setting), the Indexing Service will not add new documents to the search index. However, it will still remove deleted documents and update any "changed" documents already in the search index.
    • A "changed" document is one whose index metadata changes. If the data of any of the Included Elements or the Built in Fields change, the Indexing Service will update the documents index data.
    • "Changed" document example: An "Invoice Number" Data Field is one of the Included Elements for the Indexing Behavior. A document is already present in the search index before it is extracted. Then, the document runs through the Extract step of a Batch Process, populating the "Invoice Number" field in its Data Model. This would constitute a "change" and the index would be updated for the document.

Search Page Options

  • Access List: If set, specifies a restricted set of users who may search this index. If not set, all authenticated users may search this index.
  • AI Analysts: An optional list of AI Analysts available for chat sessions regarding the search result set.
  • Generators: An optional list of AI Generators to be available for generating documents from the search result set. This is a collection of LLM Models, Instructions, and Examples that define how an AI would structure said documents.


  1. Select the Content Model from the provided Project.
  2. Click the ellipsis button for the Behaviors property.
  3. In the "Behaviors" window click the "Add" button.
  4. Select Indexing Behavior from the drop-down menu.


  1. An Indexing Behavior will be added to the collection.
  2. For our purposes the bulk of the properties can be left to their default setting. The only thing we'll change is the Included Elements property. Click the ellipsis button for this property.
  3. In the "Included Elements" window ALT+LeftClick the Content Model to select it and all child elements.
  4. Click the "OK" button to close the "Included Elements" window.
  5. Click the "OK" button to close the "Behaviors" window.
  6. Be sure to save all changes.

The Name property determines the index's name in Azure. This name:

  • Must be all lower case.
  • Can only contain letters, numbers, dashes (-) or underscores (_)
  • Cannot contain consecutive dashes or underscores.
  • May be between 2 and 128 characters long.

Create the search index

This will create the search index in Azure. Without the search index created, documents can't be added to an index. This only needs to be done once per index.

  1. Right-click on the Content Model from the provided Project.
  2. Choose "Search > Create Search Index" from the pop-out menu.
  3. Click the "Execute" button in the "Create Search Index" window.


  1. If you navigate to your Azure AI Search resource in a web browser...
  2. ...and go to your indexes...
  3. ...you will see a new index named after the Name property of the Indexing Behavior of the Content Type this command was used against.

Index documents and data from Grooper

With the search index created you can now add data to the search index. Documents must be classified in Grooper before they can be indexed. Only Document Types/Content Types inheriting an Indexing Behavior are eligible for indexing.

There are four ways to index documents:

  1. "Add to Index" Batch Folder Object Command - This is a right-click command applied to a single document. It will index a single document. This method is best for one-off testing and submitting small numbers of documents to an index.
  2. The "Submit Indexing Job" command - This is a right-click command executed from the Content Model/Content Type configured with the Indexing Behavior. It will index all documents currently in the Grooper Repository that inherit the Content Type's Indexing Behavior. This method is useful to index a large number of documents that already exist somewhere in a Grooper Repository.
  3. Execute Activity with "Add to Index" command - This gives us a way to index documents in a Batch Process. When configured with an "Add to Index" command, the Execute activity will apply that command to each document in a Batch (therefore indexing them). This is a great way to automate document indexing at a specific point within a Batch Process's flow.
  4. Indexing Service - This is the "set it and forget it" method for document indexing. The Indexing Service is a Grooper service that runs in the background, periodically polling the Grooper Repository for new documents that need to be indexed, documents whose data has changed to update the index and documents that have been deleted to be removed from the index. This is a great way to automated document indexing in the background.

"Add to Index" Batch Folder Object Command

This is the most manual way of adding to the search index. This is an object command done on a per document basis (or via multi-selecting) in a Batch Viewer.


  1. Select the provided Batch from the node tree.
  2. Click on the "Viewer" tab.
  3. Notice that the document in this Batch is classified as a Document Type that is inheriting from the Content Model with the Indexing Behavior.
  4. Right-click on the document from the provided Batch in the Batch Viewer.
  5. Select "Search > Add To Index" from the pop-out menu.
  6. Click "Execute" in the "Add to Index" window.


Optional: How to verify your index in Azure
  1. Navigate to your Azure AI Search resource in a web browser.
  2. Click the "Indexes" button.
  3. Click on the listed index in use.


  1. Click the "Search" button to perform an open query.
  2. You should see the JSON results of the performed query in the "Results" portion of the site.z

The "Submit Indexing Job" command

This is another manual approach as it also involves an object command. Because this command is applied to a Content Type, however, it will index all documents that are classified as that Content Type or inherit from it.


First things first, we need to make sure an Activity Processing Service is running for our repository.

  1. If you click on the "Machines" folder in the node tree...
  2. ...and select the Grooper server where your repository is hosted...
  3. ...you should see an Activity Processing Service is installed and running.
    • If you do not see the appropriate service for your repository of Grooper, please visit the Grooper Command Console article for information on installing the appropriate service.


Once you've confirmed an Activity Processing Service is installed and running, you can use the "Submit Indexing Job" object command.

  1. Right-click the Content Model from the provided Project.
  2. Select "Search > Submit Indexing Job" from the pop-out menu.
  3. Take note of the "Added Documents", "Updated Documents", and "Deleted Documents" properties.
  4. Click the “Execute” button in the “Submit Indexing Job” window.
    • If there are no “added”, “updated”, or “deleted” documents, the window will close and no job will be submitted.
    • If there are, however, an “Indexing Job” will be created and the active “Activity Processing Service” will complete the tasks of the job to update the index.

Execute Activity with "Add to Index" command

This is an automated approach as it will create an "Indexing Job" as part of a Batch Process. This will perform the exact same command as the "Add to Index" object command explained earlier. When an Execute step is reached in a Batch Process a job will be created with a task for each document in scope.

  • If the document in scope does not need to be added, updated, or deleted, no task will be created. If that is true for all documents in scope, no job will be created.
  • An Activity Processing Service will need to be installed and running for the given repository in order for the job to be picked up and worked.


  1. Right-click the provided Project.
  2. Select "Add > Batch Process" from the pop-out menu.
  3. Name the Batch Process.
  4. Click the "Execute" button from the "Add" window.


  1. Right-click on the newly created Batch Process.
  2. Select "Add Activity > Utilities > Execute" from the pop-out menu.
  3. The step will be named based on the Activity chosen.
  4. Click the "Execute" button in the "Add Activity" window.


  1. Set the Scope and Folder Level properties.
    • For our purposes a Scope of Folder, and a Folder Level of 1 are accurate.
  2. Click the ellipsis button for the Steps property.
  3. Click the "Add" button in the "Steps" window.
  4. Choose Execute Command from the drop-down window.


  1. This will add an "Execute Command" to the collection of "Steps".
  2. Click the drop-down button for the Command property.
  3. Select Batch-Folder > Add to Index from the drop-down menu.
  4. Click the "OK" button from the "Steps" window.
  5. Be sure to save all changes.


  1. Click on the "Activity Tester" tab.
  2. Select the document from the provided Batch in the "Batch Viewer".
  3. Click the "Start" button to create an "Indexing Job" for the selected document.

Indexing Service (Our best practice method)

This is the most automated way to index documents. The Indexing Service will periodically poll the Grooper database to determine if classified documents that inherit from a Content Type with an Indexing Behavior need to be added, updated, or deleted. If it does, it will submit an "Indexing Job" with tasks for each document that needs to be added, updated, or deleted.

Keep in mind how the Auto Index property of an Indexing Behavior described above affects this service.

  • If set: will add, update, and/or remove documents from the index
  • If not set: will only remove deleted documents or update changed documents already in the search index

Also keep in mind:

  • an Indexing Service will need to be installed and running for the given repository in order for the job to be "Indexing Job" to be created.
  • an Activity Processing Service will need to be installed and running for the given repository in order for the job to be picked up and worked


First things first, we need to make sure an Indexing Service and an Activity Processing Service are running for our repository.

  1. If you click on the "Machines" folder in the node tree...
  2. ...and select the Grooper server where your repository is hosted...
  3. ...you should see an Activity Processing Service is installed and running.
    • If you do not see the requisite services for your repository of Grooper, please visit the Grooper Command Console article for information on installing the appropriate services.


At this point there's nothing really left to do but let the service poll the database and look for updates to submit to the index.

Search Page

Once you've got indexed documents, you can start searching for documents in the search index! The Search Page allows you to find documents in your search index.


The image below gives an overview of the Search Page interface.

  • Use the following elements to perform the query in this image to get a result with the setup configured so far:
    • Search:
      invoiceNO: 1*
    • Filter:
      invoiceDate ge 2022-01-01 and invoiceDate le 2024-01-31
    • Select:
      totalAmount
    • Order By:
      poNumber
  • Once the provided inputs are entered into the appropriate fields, click the magnifying glass button to perform the query.
  • Once the query has been executed, select the returned document from the bottom portion of the Search page's UI, the portion "that displays query results".
  • Once selected you should see the result appear in the document viewer.

Continue reading for more information on these individual fields of the Search page.


The Search Page allows you to build a search query using four components:

  • Search: This is the only required parameter. Here, you will enter your search terms, using the Lucene query syntax.
  • Filter: An optional filter to set inclusion/restriction criteria for documents returned, using the OData syntax.
  • Select: Optionally selects which fields you want displayed for each document.
  • Order By: Optionally orders the list of documents returned.

Search

The Search configuration searches the full text of each document in the index. This uses the Lucene query syntax to return documents. For a simple search query, just enter a word or phrase (enclosed in quotes "") in the Search editor. Grooper will return a list of any documents with that word or phrase in their text data.

Lucene also supports several advanced querying features, including:

  • Wildcard searches: ? and *
    Use ? for a single wildcard character and * for multiple wildcard characters.
  • Fuzzy matching: searchTerm~
    Fuzzy search can only be applied to single words (not phrases in quotes). Terms are matched based on a character edit distance of 0-2. Azure's full fuzzy search documentation can be found here: https://learn.microsoft.com/en-us/azure/search/search-query-fuzzy
    • Azure's implementation of "fuzzy matching" is not the same as Grooper's. Terms are matched based on a character edit distance of 0-2.
      • grooper~0 would only match "grooper" exactly.
      • grooper~ or grooper~1 would match any word that was up to one character different. For example, "trooper" "groopr" or "groopers".
      • grooper~2 would match any word that was up to two characters different. For example, "trouper" "looper" "groop" or "grooperey".
  • Boolean operators: AND OR NOT
    Boolean operators can help improve the precision of search query.
  • Field searching: fieldName:searchExpression
    Search built in fields and extracted Data Model values. For example, Invoice_No:8* would return any document whose extracted "Invoice No" field started with the number "8"
  • Regular expression matching: /regex/
    Enclose a regex pattern in forward slashes to incorporate it into the Lucene query. For example, /[0-9]{3}[a-z]/
    • Lucene regex searches are matched against single words/terms.
    • Lucene regex does not use the Perl Compatible Regular Expressions (PCRE) library. Most notably, this means it does not use single-letter character classes, such as \d to match a single digit. Instead, enter the full character class in brackets, such as [0-9] to match a single digit.

Azure's full documentation of Lucene query syntax can be found here: https://learn.microsoft.com/en-us/azure/search/query-lucene-syntax

Filter

First you search, then you filter. The Filter parameter specifies criteria for documents to be included or excluded from the search results. This gives users an excellent mechanism to further fine tune their search query. Commonly, users will want to filter a search set based on the field values. Both built in index fields and/or values extracted from a Data Model can be incorporated into the filter criteria.

Azure AI Search uses the OData syntax to define filter expressions. Azure's full OData syntax documentation can be found here: https://learn.microsoft.com/en-us/azure/search/search-query-odata-filter

The query parameters entered above in the example included a filter:
invoiceDate ge 2022-01-01 and invoiceDate le 2024-01-31
This is actually OData syntax, but a bit shortened. The Search page UI is making it a bit easier by not having to specify the filter syntax. If so, in OData it would look specifically like the following:
$filter=invoiceDate ge 2022-01-01 and invoiceDate le 2024-01-31

This query therefore is filtering by the "invoiceData" Data Element for results greater than January 1st, 2022 and less than January 31st, 2024.

Select

The Select parameter defines what field data is returned in the result list. You can select any of the built in fields or Data Elements defined in the Indexing Behavior. This can be exceptionally helpful when navigating indexes with a large number of fields. Multiple fields can be selected using a comma separated list (e.g. Field1,Field2,Field3)

Like Filter parameter above, Select is shortening the OData syntax for us. In the example it was:
totalAmount
In strict OData syntax it would be:
$select=totalAmount

The above query will only display the "totalAmount" Data Element in the search results.

Order By

Order By is an optional parameter that will define how the search results are sorted.

  • Any field in the index can be used to sort results.
  • The field's value type will determine how items are sorted.
    • String values are sorted alphabetically.
    • Datetime values are sorted by oldest or newest date.
    • Numerical value types are sorted smallest to largest or largest to smallest.
  • Sort order can be ascending or descending.
    • Add asc after the field's name to sort in ascending order. This is the default direction.
    • Add desc after the field's name to sort in ascending order.
  • Multiple fields may be used to sort results.
    • Separate each sort expression with a comma (e.g. Field1 desc,Field2)
    • The leftmost field will be used to sort the full result list first, then it's sub-sorted by the next, then sub-sub-sorted by the next, and so on.
Like the Filter and Select parameters above, Order By is shortening the OData syntax for us. In the example it was:
poNumber
In strict OData syntax it would be:
$orderby=poNumber
The above query will order the returned documents by the value of the "poNumber" Data Element. A direction was not included, so the default tof "ascending" which would translate to:
$orderby=poNumber asc

Search Page Commands

There are several new commands users can execute from the Search page. These commands give users a new way of starting and continuing work in Grooper. These commands can be divided into two sets of commands: "result set commands" and "document commands"

Search Query Commands

These commands are accessed above the query editor. They are used in conjunction with written querys.

  • Execute Query - This will execute the query written in the Search, Filter, Order By, and Select parameters.
  • Clear Query - This will clear all query parameters.
  • AI Generate - This will leverage AI Generators defined on the Indexing Behavior associated with this index to allow users to craft documents based on the results of the query.
  1. Favorites - This will allow users to store and retrieve queries they use frequently.

Result Set Commands

These commands can be accessed from a dropdown list in the Search page UI. They can be applied to the entire result set or a selection from the result set.

  • Create Batch - Creates a Batch from the result set and submits an Import Job to start processing it.
  • Submit Job - Submits a Processing Job for documents in the result set. This command is intended for "on demand" activity processing.
  • Analyst Cat - Select an AI Analyst to start a chat session with the result set.
  • Download - Download a document, generated from the result set. May be one of the following:
    • Download PDF - Generates a single bookmarked PDF with optional search hit highlights.
    • Download ZIP - Generates a ZIP file containing each document in the result set.
    • Download CSV - Generates a CSV file from the result set's data fields.
    • Download Custom - Generates a custom document using an "AI Generator"

Document Commands

These commands can be accessed from the Search page when right-clicking a document in the result list.

  • Go to Document - Navigates users with Design page permissions to that document in the Grooper node tree.
  • Review Document - Opens the document in a Review Viewer with a Data View, Folder View and Thumbnail View.
  • Copy Link - Creates a URL link to the document. When clicking the link users will be taken to a Review Viewer with a Data View, Folder View and Thumbnail View.

More on Lucene

The Lucene query language is a powerful and flexible search language used for querying full-text search engines based on the Apache Lucene library. Lucene provides the foundation for various search platforms, including Elasticsearch, Solr, and most importantly for Grooper, Azure AI Search. The query language allows users to perform complex searches using a syntax that supports a range of operators and expressions.

Key Features of Lucene Query Language:

  • Boolean Operators: Use AND, OR, and NOT to combine or exclude terms.
  • Field-specific Searches: Query specific fields in the indexed documents (e.g., title:Azure AND content:AI).
  • Wildcards: Use * (matches multiple characters) and ? (matches a single character) within terms.
  • Phrase Searches: Use quotes to search for exact phrases (e.g., artificial intelligence).
  • Proximity Searches: Find terms within a certain distance from each other (e.g., cloud computing"~5).
  • Fuzzy Searches: Use the tilde ~ symbol to find terms with similar spellings (e.g., search~).
  • Range Searches: Search within a range of values (e.g., date:[20230101 TO 20231231]).

How Azure AI Search Uses Lucene Query Language

  • Query Syntax: Azure AI Search allows users to write queries using Lucene syntax directly in the search requests. This enables precise and complex searches, including filtering, scoring, and relevance tuning.
  • Fielded Searches: Azure AI Search supports querying specific fields in your index, much like Lucene. For example, you can search for documents where a certain field matches a given term (e.g., fieldName:searchTerm).
  • Boolean Logic: Users can combine multiple search criteria using Boolean operators to refine search results. This is useful in narrowing down search results by combining conditions (e.g., category:Technology AND date:[20230101 TO 20231231]).
  • Faceting and Filtering: Azure AI Search leverages Lucene's capabilities to perform faceted searches, which allow users to filter results by different categories or fields (e.g., price ranges, ratings).
  • Scoring Profiles: Azure AI Search allows customization of result ranking by defining scoring profiles, where Lucene query clauses can influence the scoring based on certain fields or conditions.
  • Highlighting: Azure AI Search can highlight the parts of the document that match the search query, using Lucene's query syntax to determine what to highlight.
  • Integration with REST APIs: Users can pass Lucene queries in the search parameter when interacting with Azure AI Search's REST APIs. This enables developers to craft specific search experiences directly within their applications.

Examples

Let's consider a scenario where you have a series of invoice documents indexed in Azure AI Search. The documents include key data points like invoice_id, vendor_name, invoice_date, amount, status, and line_items.

Example 1: Finding All Invoices from a Specific Vendor Suppose you want to find all invoices from the vendor Acme Corp.

  • Lucene Query:
    vendor_name:"Acme Corp."
  • Explanation:
This query searches for all documents where the vendor_name field exactly matches "Acme Corp.".

Example 2: Finding Invoices Above a Certain Amount Let's say you need to find all invoices where the amount is greater than $10,000.

  • Lucene Query:
    amount:[10000 TO *]
  • Explanation:
This query searches for all invoices where the amount is greater than or equal to 10,000. The * wildcard indicates there is no upper limit in this range.

Example 3: Finding Unpaid Invoices within a Date Range You want to find all invoices with the status of Unpaid that were issued in January 2024.

  • Lucene Query:
    status:Unpaid AND invoice_date:[20240101 TO 20240131]
  • Explanation:
The query searches for invoices that have a status of Unpaid and an invoice_date between January 1, 2024, and January 31, 2024.

Example 4: Searching for Specific Items in Line Items If you want to find invoices that include a line item with the description laptop.

  • Lucene Query:
    line_items.description:laptop
  • Explanation:
This query looks into the line_items table (consider tables contain arrays of information) and searches for any line item where the description field contains the word laptop.

Example 5: Combining Multiple Conditions Let's say you want to find all invoices from Acme Corp. issued after July 1, 2024, with an amount greater than $5,000.

  • Lucene Query:
    vendor_name:"Acme Corp." AND invoice_date:[20240701 TO *] AND amount:[5000 TO *]
  • Explanation:
This query combines multiple conditions to filter invoices from Acme Corp. where the invoice_date is after July 1, 2024, and the amount is greater than $5,000.

These examples demonstrate how you can leverage the Lucene query language to perform detailed searches on your indexed invoice documents in Azure AI Search. By using these queries, you can quickly and effectively find specific documents that match complex criteria.

In essence, Azure AI Search uses the Lucene query language to enable complex and customizable search functionality, giving developers and users the ability to craft tailored search queries that meet their specific needs.

More on OData

The OData (Open Data Protocol) query language is a standardized protocol for querying and updating data, particularly in web services. It is built on RESTful principles and allows for querying data in a simple and consistent way across various data sources. OData is widely used in services like Azure AI Search, where it complements other query languages, such as Lucene, by providing additional filtering, sorting, and pagination capabilities.

Key Features of OData Query Language

  • Filtering ($filter): Apply conditions to retrieve only the data that matches specified criteria.
    OData's $filter option allows you to apply precise filters to your search results. For example, you can filter results based on date ranges, numerical values, or text matches. This is especially useful when you want to refine search results according to specific conditions.
  • Ordering ($orderby): Sort the results based on specified fields.
    With $orderby, you can sort search results by one or more fields, either in ascending or descending order. This is useful when you need to order search results by relevance, date, or other criteria.
  • Selection ($select): Specify which fields to include in the response.
    OData's $select allows you to specify which fields to include in the search results. This can reduce the payload size by only returning the necessary fields from the documents.
  • Top and Skip ($top and $skip): Control pagination by specifying the number of records to return and the offset.
    OData provides $top and $skip parameters to control pagination in search results. $top specifies how many results to return, while $skip determines how many records to skip. This is useful for handling large datasets where you want to present data in smaller chunks.
  • Expanding ($expand): Retrieve related entities in a single query (useful for relationships in data models).
  • Counting ($count): Get the total number of records matching a query.
    The $count option enables you to retrieve the total number of documents that match a query without retrieving the actual documents. This can be combined with other query options to get insights into the dataset size.

Examples

Filter Invoices by Status and Date Range

  • OData Query:
    $filter=status eq 'Unpaid' and invoice_date ge 2024-01-01 and invoice_date le 2024-01-31
  • Explanation:
This query filters for invoices where the status is Unpaid and the invoice_date is between January 1, 2024, and January 31, 2024.

Sort Invoices by Amount in Descending Order

  • OData Query:
    $orderby=amount desc
  • Explanation:
This query sorts the invoices by the amount field in descending order, showing the highest amounts first.

Select Specific Fields from the Results

  • OData Query:
    $select=invoice_id,vendor_name,amount
  • Explanation:
This query returns only the invoice_id, vendor_name, and amount fields in the results, omitting other fields.

Pagination with Top and Skip

  • OData Query:
    $top=10&$skip=20
  • Explanation:
This query retrieves the third page of results, with 10 results per page (i.e., results 21-30).

Counting Matching Documents

  • OData Query:
    $count=true&$filter=status eq 'Paid'
  • Explanation:
This query returns the count of documents where the status is Paid without retrieving the documents themselves.

In Azure AI Search, OData is used to refine, organize, and manage the results of search queries. It works alongside Lucene to offer a robust and flexible querying mechanism, making it easier to handle complex data retrieval scenarios in search applications.