What's New in Grooper 2024: Difference between revisions
Dgreenwood (talk | contribs) |
Dgreenwood (talk | contribs) |
||
Line 237: | Line 237: | ||
Indexing Behavior (and related indexing object commands to add and update search indexes) | Indexing Behavior (and related indexing object commands to add and update search indexes) | ||
Indexing Service | Indexing Service | ||
Before you can start using the Search page to search for documents, there's some basic setup you need to perform: | |||
# Create an Azure AI Search Service | |||
#: The following article from Microsoft instructs users how to create a Search Service. | |||
#:: https://learn.microsoft.com/en-us/azure/search/search-create-service-portal | |||
#* FYI: When setting up the AI Search service in Azure, you will need to add a "query key" and copy down the service's URL. Both of these are needed for the next step. | |||
# Add an "AI Search" '''''Repository Option''''' to the Grooper Repository's root node. | |||
#: You will need the service's URL and query key (API key) from the AI Search service in Azure. You will need to add a query key in Azure if you have not done so already. | |||
# Add an '''''Indexing Behavior''''' on a '''Content Model''' in Grooper. | |||
#: Documents must be classified before they can be indexed. Only '''Document Types/Content Types''' inheriting an '''''Indexing Behavior''''' are eligible for indexing. | |||
# Create the search index by right-clicking the '''Content Model''' and selecting "Search > Create Search Index" | |||
#: This only needs to be done once per search index. This creates the search index in Azure. Without the search index created, documents can't be added to an index. | |||
# Index documents in one of the following ways: | |||
#* Manually, one document at a time by right-clicking the document and using the "Add to Index" command. | |||
#* Using an '''''Execute''''' activity in a '''Batch Process''' to apply the "Add to Index" command to all documents in a '''Batch'''. | |||
#* Running the Grooper Indexing Service to index documents automatically in the background. | |||
==== Repository Options: AI Search ==== | ==== Repository Options: AI Search ==== | ||
Line 248: | Line 265: | ||
==== Indexing documents for search ==== | ==== Indexing documents for search ==== | ||
Before documents can be searched, they must be indexed. The search index is the searchable content you want to search (including each document's full OCR or native text obtained from the Recognize activity). We use the Azure AI Search Service to create search indexes you can search using the Search page in Grooper. | Before documents can be searched, they must be indexed. The search index is the searchable content you want to search (including each document's full OCR or native text obtained from the Recognize activity). We use the Azure AI Search Service to create search indexes you can search using the Search page in Grooper. Documents are indexed by adding them to a search index in Grooper. | ||
Documents may be added to a search index in one of the following ways: | |||
* Right-clicking the document and using the "Add to Index" command | |||
*: This is the most manual way of indexing documents. | |||
* Using an '''''Execute''''' activity in a '''Batch Process''' to apply the "Add to Index" command to all documents in a '''Batch'''. | |||
*: This is a more automated way of indexing documents. It adds documents to a search index at a specific point in a '''Batch Process'''. | |||
* Running the Grooper Indexing Service to index documents automatically in the background. | |||
*: This is the ''most'' automated way of indexing documents. The Indexing Service periodically polls the Grooper database and adds newly classified documents to the index, updates the index if changes are made (to their extracted data for example), or removes documents from the index if they've been deleted. | |||
==== The Search Page ==== | ==== The Search Page ==== |
Revision as of 13:51, 17 June 2024
WIP |
WORK IN PROGRESS!! Please excuse our mess. This article is under construction. |
Grooper version 2024 is here!
|
Moving Grooper fully into the web
Deploying Grooper over a web server is a more distributable, more secure, and more modern experience. Version 2022 started Grooper's foray into web development with a web client for user operated Review tasks. Versions 2023 and 2023.1 expanded our web client to incorporate all aspects of Grooper in the web client. Version 2024 fully cements our commitment to moving Grooper to a web-based application.
Thick client removal
In 2024, there is no longer a Grooper thick client (aka "Windows client"). There is only the Grooper web client. This opens Grooper up to several advantages for cloud-based app development and cloud-based deployments.
All thick client Grooper applications have an equivalent in the Grooper web client. Most of these are now pages you will navigate to from the web client. For those unfamiliar with the Grooper web client, refer to the table below for the web client equivalent versions of thick client apps in version 2024.
Former thick client application |
Current web client equivalent |
Grooper Design Studio |
The Design page |
Grooper Dashboard |
The Batches page |
Grooper Attended Client |
The Tasks page |
Grooper Kiosk |
The Stats page (displaying stats queries in a browser window) |
Grooper Config |
Grooper Command Console (GCC)
|
Grooper Unattended Client |
Replaced by "gcc services host" command in GCC |
Grooper Command Console
Grooper Command Console (or GCC) is a replacement for the thick client administrative application, Grooper Config. Previous functions performed by Grooper Config can be accomplished in Grooper Command Console. This includes:
- Connecting to Grooper Repositories
- Installing and managing Grooper Services
- Managing licensing for self hosted licensing installations
Grooper Command Console is a command line utility. All functions are performed using command line commands rather than a "point and click" user interface. Users of previous versions will find the difference somewhat shocking, but the command line interface has several advantages:
- Most administrative functions are accomplished with a single command. In Grooper Config, to accomplish the same function you would perform several clicks to do the same thing. Once you are familiar with the GCC commands, Grooper Command Console ends up saving you time.
- Commands can be easily scripted. There was not an easy way to procedurally execute the functions of Grooper Config like creating a Grooper Repository or spinning up new Grooper services. GCC commands allow you to do this.
- Scaling services is much easier. In previous versions of Grooper, we have done proof-of-concept tests to ensure Grooper can scale in cloud deployments (such as using auto-scaling in Amazon AWS instances). However, in older Grooper versions scaling Activity Processing services was somewhat clunky. Using GCC commands to spin up services makes this process much simpler. Grooper Command Console also has specific commands to make scaling with Docker containers simpler.
For more information about Grooper Command Console, visit the Grooper Command Console article.
Improved web UI: New icons!
Improved integrations with Large Language Models (LLMs)
Innovations in Large Language Models (or LLMs) have changed the landscape of artificial intelligence. Companies like OpenAI and their GPT models have developed LLM-based technologies, like ChatGPT, that are highly effective at natural language processing. Being fully committed to advancing our capabilities through new AI integrations, Grooper has vastly improved what we can do with LLM providers such as OpenAI.
Repository Options: LLM Connector
Repository Options are new to Grooper 2024. They add new functionality to the whole Grooper Repository. These optional features are added using the Options property editor on the Grooper Root node.
OpenAi and other LLM integrations are made by added an LLM Connector to the list of Repository Options. The LLM Connector provides connectivity to LLMs like OpenAi's GPT models. This allows access to Grooper features that leverage LLM chatbots (discussed in further detail below).
Currently there are two LLM provider types:
- OpenAI - Connects Grooper to the OpenAI API or an OpenAI-compatible clone (used for hosting GPT models on local servers)
- Azure - Connects Grooper to individual chat or embeddings endpoints available in Microsoft Azure
New and improved LLM-based extraction techniques
First and foremost, in 2024 you will see new and improved ways to extract data from your documents using LLMs. Because LLMs are so good at processing natural language, set up for these new extraction techniques is done in a fraction of the time of traditional extractors in Grooper.
New in 2024 you will find:
- AI Extract: A "Fill Method" designed to extract a full Data Model with little configuration necessary.
- Clause Detection: A new Data Section extract method designed to find clauses of a certain type in a contract.
- Ask AI: This extractor type replaces the deprecated "GPT Complete" extractor, with new functionality that allows table extraction using responses from LLMs.
AI Extract
AI Extract introduces the concept of a Fill Method in Grooper. Fill Methods are configured on "container elements", like Data Models, Data Sections and Data Tables. The Fill Method runs after data extraction. It will fill in the Data Model using whatever method is configured (Fill Methods can be configured to overwrite initial extraction results or only supplement them).
AI Extract is the first Fill Method in Grooper. It uses chat responses from Large Language Models (like OpenAI's GPT models), to fill in a Data Model. We have designed this fill method to be as simple as possible to get data back from a chat response and into fields in the Data Model. In many cases, all you need to do is add Data Elements to a Data Model to return results.
AI Extract uses the Data Elements' names, data types (string, datetime, etc.) and (optionally) descriptions to craft a prompt sent to an LLM chatbot. Then, it parses out the response, populating fields, sections and even table cells. As long as the Data Elements' names are descriptive ("Invoice Number" for an invoice number located on an invoice), that's all you need to locate the value in many cases. With no further configuration necessary, this is the fastest to deploy method of extracting data in Grooper to date.
Clause Detection
Detecting contract clauses of a certain type has always been doable in Grooper using Field Classes. However, training the Field Class is a laborious and tedious process. This can be particularly taxing when attempting to locate several different clauses throughout a contract.
Large Language Models make this process so much simpler. LLMs are well suited to find examples of clauses in contracts. Natural language processing, after all, is their bread and butter. Clause Detection is a new Data Section extract method that uses chat responses to locate clauses in a contract. All you have to do is provide one or more written examples of the clause and Clause Detection does the rest. It parses the clause's location from the chatbot's response, which then forms the Data Section's data instance. This can be used to return the full text of a clause, extract information in the clause to Data Fields or both.
Ask AI
Ask AI is a new Grooper Extractor Type in Grooper 2024. It was created as a replacement for the "GPT Complete" extractor, which uses a deprecated method to call OpenAI GPT models. Ask AI works much like GPT Complete. It is an extractor configured with a prompt sent to a LLM chatbot and returns the chatbot's response.
Ask AI is more robust than its predecessor in that:
- Ask AI has access to more LLM models, including those accessed via the OpenAI API, privately hosted GPT clones, and compatible LLMs from Azure's machine learning model catalog.
- Ask AI can more easily parse JSON responses.
- Ask AI has a mechanism to decompose chat responses into extraction instances (or "sub-elements"). This means Ask AI can potentially be used for a Row Match Data Table extractor.
Chat with your documents
Publicly accessible LLM chatbots like ChatGPT are always limited by what content they were trained on. The documents you're processing are probably not part of their training set. If they were, the LLM would be able to process it more effectively. You could even "chat" with your documents. You could ask more specific questions and get more accurate responses.
Now you can do just that! Using OpenAI's Assistants API, we've created a mechanism to quickly generate custom AI chatbot assistants in Grooper that can answer questions directly about one or more selected documents.
Build AI assistants with Grooper AI Analysts
AI Analysts are a new node type in Grooper that facilitate chatting with a document set. Creating an AI Analyst requires an OpenAI API account. AI Analysts create task-specific OpenAI "assistants" that answer questions based on a "knowledge base" of supplied information. Selecting one or more documents, users can chat with the assistant in Grooper about the documents. The text data from these documents form the assistant's knowledge base.
Using this mechanism, users can have a conversation with a single document or a Batch with hundreds of documents. Each conversion is logged as a "Chat Session" and stored as a child of the AI Analyst. These Chat Sessions can be accessed again (either in the Design Page's Node Tree or the Chat Page), allowing users to continue previous conversions.
The process of creating an AI Analyst and starting a Chat Session is fairly straightforward:
- Add an LLM Connector to the Grooper Repository Options.
- Create an AI Analyst.
- Select the documents you want to chat with. This can be done in multiple ways.
- From a Batch Viewer or Folder Viewer.
- From a Search Page query (more on the Search Page below).
- From the Chat Viewer in Review
- Start a Chat Session. This can also be done in multiple ways.
- Using the Discuss command
- Using the AI Dialogue activity. This is a way of automating chat questions.
- Using the Chat Viewer in Review
Chat in Review
The Chat View is a new Review View that can be added to a Review step in a Batch Process. This allows human operators a mechanism to chat with a document during Review. The Chat View facilitates a chat with an AI Analyst. Users may select one document or multiple documents and enter questions into the chat console. The human reviewer can ask questions to better understand the document or help locate information to complete their review.
Furthermore, if there are "stock questions" any Review user should be asking, the new AI Dialogue activity can automate a scripted set of questions with an AI Analyst. AI Dialogue starts a Chat Session for each document. Any "Predefined Messages" configured for the AI Analyst will be asked by the AI Dialogue activity in an automated Chat Session. The responses for the Chat Session are then saved to each Batch Folder. The answers to these questions can be then reviewed by a user during Review with a Chat View. This also allows users to continue the conversation with Predefined Messages getting the conversation started.
Chat Page
The Chat Page is a brand new UI page that allows users to continue previous Chat Sessions. Chat Sessions are archived as children of an AI Analyst . Each Chat Session is organized into subfolders by user name. The Chat Page allows users to access their previous Chat Sessions stored in these folders. Furthermore, since Chat Sessions are archived by user name, users will only have access to Chat Sessions created by their user session.
Grooper as a document repository
Traditionally Grooper has been solely a document processing platform. The process has always been (1) get documents into Grooper (2) condition them for processing (3) get the data you want from them Grooper 2024 (4) get the documents and data out of Grooper and then delete those Batches as soon as they are gone. Grooper was never designed to be a document repository. It was never designed to hold documents and data long-term. All that is changing starting in version 2024!
2024 is our first step to envisioning Grooper not just as a document processing platform but a document repository and content management system. Our goal is to facilitate users who do want to keep documents in Grooper long term. There will be several advantages to keeping documents in Grooper long term:
- You only need one system to manage your documents. No need to export content to a separate content management system.
- Grooper's hierarchical data modeling allows the documents' full extracted data to be stored in Grooper, including more complex multi-instance data structures like table rows.
- If you need to reprocess a document, you don't have to bring it back into Grooper. It's already present and conditioned for further processing.
However to make the shift to Grooper as a useable document repository two things need to happen:
- Document storage must be more efficient.
- There must be some mechanism to search for and retrieve documents and their data.
To facilitate these points we have redesigned our Batch architecture and implemented a document search feature using Microsoft Azure's AI Search API.
Batch redesign
To make Grooper a useable document repository, we need to move away from a "batch processing" focused design and more towards a "document processing" focusing design. If documents are going to hang around in Grooper permanently, it needs to be easier to process them "in place", wherever they are in the repo.
To be clear, Batches and Batch Processes aren't going anywhere. We've just made some big changes to Batches to make them leaner and more efficient.
Batches no longer store a local copy of a Batch Process as a child.
In the past, whenever a Batch is created it stores a read-only copy of the Batch Process used to process it as one of its children. This is very inefficient, especially when processing Batches with a single document (or in other words, just one document). Every Batch that comes into Grooper just has an extra Batch Process and set of Batch Process steps tied to it. These additional nodes clutter up the Grooper database and makes querying Batches more inefficient than it needs to be.
In 2024, Batches will no longer house a clone of a Batch Process. Instead they will reference a published Batch Process. Each published version of a Batch Process is archived permanently (until a Grooper designer deletes unused processes).
There are new tools to help facilitate testing from the Design page.
The only potential drawback to the Batch redesign comes in testing. In the past, Grooper designers would use the local Batch Process copies to test steps in production Batches. If there is no longer a local copy, how are users going to test production Batches in this way?
There are several new tools that make testing production Batches easier.
- Published versions of Batch Process will now be able to access the production branch of the Batches tree for Batch testing.
- Production Batches have a "Go To Process" button. Pressing this button will navigate to the Batch's referenced process and selects the Batch in the Activity Tester.
- Published versions of Batch Processes now have a "Batches" tab. This will show a list of all Batches currently using the selected process. These Batches can then be managed the same way they would be managed from the Batches Page.
Batches no longer have a "root folder".
The only reason Batches had a root folder was to distinguish the folder and page content from the local copy of a Batch Process. Because there is no longer a Batch Process child, there is no need for a root folder. So, its gone!
Instead, the Batch object itself is the root of the Batch. Batches now have all the properties of a Batch Folder as well as a Batch. This makes Batches more lightweight, particularly for single-document Batches.
- For single-document Batches, the Batch is not just a container for documents, but in effect, the document itself!
- For Batches with multiple documents, the Batch now acts as the root folder. This gets rid of a now unnecessary (and previously often confusing) level in the Batch hierarchy.
Bonus! New Batch naming options
While not directly related to the Batch redesign, we have a new set of Batch Name Option properties for in version 2024. These options can be configured for Batches created by Import Jobs (either ad-hoc from the Imports page or procedurally by an Import Watcher). Previously, users could only affix a text prefix to a Batch when importing documents. The Batch would be named using the prefix and a time stamp (e.g. "Batch Prefix 2024-06-12 03:14:15 PM").
Users can now name Batches with a text prefix, one to three "segments", and a text suffix. This gives users a lot more flexibility in what they can name Batches created from imports. The "segments" may be set to one of the following:
- Sequence - A sequence number of the current Batch. The first Batch imported will be "1" then "2" and so on. This sequence may optionally be zero-padded ("00001" then "00002" and so on)
- DateTime - The current date and time.
- Process - The name of the assigned Batch Process.
- ContentType - The name of the Content Type assigned to the Batch.
- Username - The current Windows user's logon name.
- Machine - The name of the current machine.
- BatchId - The integer id number for the batch.
Search Page & AI Search
Any document repository worth its salt should have a document (and data) retrieval mechanism. FINISH ME LATER
Job oriented processing Indexing Behavior (and related indexing object commands to add and update search indexes) Indexing Service
Before you can start using the Search page to search for documents, there's some basic setup you need to perform:
- Create an Azure AI Search Service
- The following article from Microsoft instructs users how to create a Search Service.
- FYI: When setting up the AI Search service in Azure, you will need to add a "query key" and copy down the service's URL. Both of these are needed for the next step.
- Add an "AI Search" Repository Option to the Grooper Repository's root node.
- You will need the service's URL and query key (API key) from the AI Search service in Azure. You will need to add a query key in Azure if you have not done so already.
- Add an Indexing Behavior on a Content Model in Grooper.
- Documents must be classified before they can be indexed. Only Document Types/Content Types inheriting an Indexing Behavior are eligible for indexing.
- Create the search index by right-clicking the Content Model and selecting "Search > Create Search Index"
- This only needs to be done once per search index. This creates the search index in Azure. Without the search index created, documents can't be added to an index.
- Index documents in one of the following ways:
- Manually, one document at a time by right-clicking the document and using the "Add to Index" command.
- Using an Execute activity in a Batch Process to apply the "Add to Index" command to all documents in a Batch.
- Running the Grooper Indexing Service to index documents automatically in the background.
Repository Options: AI Search
Repository Options are new to Grooper 2024. They add new functionality to the whole Grooper Repository. These optional features are added using the Options property editor on the Grooper Root node.
To search documents in Grooper, we use the Azure AI Search Service. In order to connect to an AI Search Service, the AI Search option must be added to the list of Repository Options. Here, users will enter the Azure AI Search URL endpoint where calls are issued and the query API key. Both of these can be obtained from the Microsoft Azure portal once you have added an Azure AI Search resource.
Once an AI Search option is added, you will be able to add an Indexing Behavior to one or more Content Types, create a search index, index documents and search them using the Search Page.
Indexing documents for search
Before documents can be searched, they must be indexed. The search index is the searchable content you want to search (including each document's full OCR or native text obtained from the Recognize activity). We use the Azure AI Search Service to create search indexes you can search using the Search page in Grooper. Documents are indexed by adding them to a search index in Grooper.
Documents may be added to a search index in one of the following ways:
- Right-clicking the document and using the "Add to Index" command
- This is the most manual way of indexing documents.
- Using an Execute activity in a Batch Process to apply the "Add to Index" command to all documents in a Batch.
- This is a more automated way of indexing documents. It adds documents to a search index at a specific point in a Batch Process.
- Running the Grooper Indexing Service to index documents automatically in the background.
- This is the most automated way of indexing documents. The Indexing Service periodically polls the Grooper database and adds newly classified documents to the index, updates the index if changes are made (to their extracted data for example), or removes documents from the index if they've been deleted.
The Search Page
Document "Generators"
Miscellaneous
Tabular View in Data Review
Azure-based text analysis extractors
Key phrase, named entity, and PII extract