What's New in Grooper 2024: Difference between revisions

From Grooper Wiki
Line 119: Line 119:


'''''AI Extract''''' uses the '''Data Elements'''' names, data types (string, datetime, etc.) and (optionally) descriptions to craft a prompt sent to an LLM chatbot.  Then, it parses out the response, populating fields, sections and even table cells.  As long as the '''Data Elements'''' names are descriptive ("Invoice Number" for an invoice number located on an invoice), that's all you need to locate the value in many cases.  With no further configuration necessary, this is the fastest way to setup data extraction in Grooper data to date.
'''''AI Extract''''' uses the '''Data Elements'''' names, data types (string, datetime, etc.) and (optionally) descriptions to craft a prompt sent to an LLM chatbot.  Then, it parses out the response, populating fields, sections and even table cells.  As long as the '''Data Elements'''' names are descriptive ("Invoice Number" for an invoice number located on an invoice), that's all you need to locate the value in many cases.  With no further configuration necessary, this is the fastest way to setup data extraction in Grooper data to date.
==== LLM Table Reader ====
'''''LLM Table Reader''''' is a new '''''Table Extract Method''''' for '''Data Tables'''.  It uses the same logic as '''''AI Extract'''''.  '''''LLM Table Reader''''' uses its child '''Data Columns'''' names and value types and any supplemental text entered in their '''''Description''''' properties to form a prompt sent to the LLM.  The LLM then gives a response which is parsed into rows and cells for the '''Data Table'''.
Additional configuration includes all properties available to '''''AI Extract''''', including different quoting methods, alignment options, header and footer detection, and multiline row support.


==== Clause Detection ====
==== Clause Detection ====

Revision as of 18:43, 13 September 2024

Grooper version 2024 is here!

Below you will find brief descriptions on new and/or changed features.
When available, follow any links to extended articles on a topic.

FYI

We have overhauled our "Install and Setup" article for 2024. Information on installing version 2024 can be found in the article below.

Moving Grooper fully into the web

Deploying Grooper over a web server is a more distributable, more secure, and more modern experience. Version 2022 started Grooper's foray into web development with a web client for user operated Review tasks. Versions 2023 and 2023.1 expanded our web client to incorporate all aspects of Grooper in the web client. Version 2024 fully cements our commitment to moving Grooper to a web-based application.

Thick client removal

In 2024, there is no longer a Grooper thick client (aka "Windows client"). There is only the Grooper web client. This opens Grooper up to several advantages for cloud-based app development and cloud-based deployments.

All thick client Grooper applications have an equivalent in the Grooper web client. Most of these are now pages you will navigate to from the web client. For those unfamiliar with the Grooper web client, refer to the table below for the web client equivalent versions of thick client apps in version 2024.

Former thick client application

Current web client equivalent

Grooper Design Studio

The Design page

Grooper Dashboard

The Batches page

Grooper Attended Client

The Tasks page

Grooper Kiosk

The Stats page (displaying stats queries in a browser window)

Grooper Config

Grooper Command Console (GCC). See below for more information

Grooper Unattended Client

Activity Processing services can be hosted in-process using the gcc services host command in GCC

Grooper Command Console

Grooper Command Console (or GCC) is a replacement for the thick client administrative application, Grooper Config. Previous functions performed by Grooper Config can be accomplished in Grooper Command Console. This includes:

  • Connecting to Grooper Repositories
  • Installing and managing Grooper Services
  • Managing licensing for self hosted licensing installations


Grooper Command Console is a command line utility. All functions are performed using command line commands rather than a "point and click" user interface. Users of previous versions will find the difference somewhat shocking, but the command line interface has several advantages:

  • Most administrative functions are accomplished with a single command or a small number of commands. In Grooper Config, to accomplish the same function you would perform several clicks to do the same thing. Once you are familiar with the GCC commands, Grooper Command Console ends up saving you time.
  • Commands can be easily scripted. There was not an easy way to procedurally execute the functions of Grooper Config like creating a Grooper Repository or spinning up new Grooper services. GCC commands allow you to do this.
  • Scaling services is much easier. In previous versions of Grooper, we have done proof-of-concept tests to ensure Grooper can scale in cloud deployments (such as using auto-scaling in Amazon AWS instances). However, in older Grooper versions scaling Activity Processing services was somewhat clunky. Using GCC commands to spin up services makes this process much simpler. Grooper Command Console also has specific commands to make scaling with Docker containers simpler.


For more information about Grooper Command Console, visit the Grooper Command Console article.

Improved web UI: New icons!

One of the first things Grooper Design page users will notice in version 2024 is our icons have changed. Our old icons served us well over the last several years, but were starting to look a little outdated. Part of keeping Grooper a modern platform with modern features is keeping its look modern too. Furthermore, all our new icons are scalable with your browser. They will scale larger and smaller as you zoom in and out without loosing fidelity.

See below for a list of new icons. Images of the old icons are included for reference.

Improved integrations with Large Language Models (LLMs)

Innovations in Large Language Models (or LLMs) have changed the landscape of artificial intelligence. Companies like OpenAI and their GPT models have developed LLM-based technologies, like ChatGPT, that are highly effective at natural language processing. Being fully committed to advancing our capabilities through new AI integrations, Grooper has vastly improved what we can do with LLM providers such as OpenAI.

Repository Options: LLM Connector

Repository Options are new to Grooper 2024. They add new functionality to the whole Grooper Repository. These optional features are added using the Options property editor on the Grooper Root node.

OpenAi and other LLM integrations are made by adding an LLM Connector to the list of Repository Options. The LLM Connector provides connectivity to LLMs like OpenAi's GPT models. This allows access to Grooper features that leverage LLM chatbots (discussed in further detail below).

Currently there are two LLM provider types:

  • OpenAI - Connects Grooper to the OpenAI API or an OpenAI-compatible clone (used for hosting GPT models on local servers)
  • Azure - Connects Grooper to individual chat completion or embeddings endpoints available in Microsoft Azure

New and improved LLM-based extraction techniques

First and foremost, in 2024 you will see new and improved ways to extract data from your documents using LLMs. Because LLMs are so good at processing natural language, set up for these new extraction techniques is done in a fraction of the time of traditional extractors in Grooper.

New in 2024 you will find:

  • AI Extract: A "Fill Method" designed to extract a full Data Model with little configuration necessary.
  • Clause Detection: A new Data Section extract method designed to find clauses of a certain type in a contract.
  • Ask AI: This extractor type replaces the deprecated "GPT Complete" extractor, with new functionality that allows table extraction using responses from LLMs.

AI Extract

AI Extract introduces the concept of a Fill Method in Grooper. Fill Methods are configured on "container elements", like Data Models, Data Sections and Data Tables. The Fill Method runs after data extraction. It will fill in the Data Model using whatever method is configured (Fill Methods can be configured to overwrite initial extraction results or only supplement them).

AI Extract is the first Fill Method in Grooper. It uses chat responses from Large Language Models (like OpenAI's GPT models), to fill in a Data Model. We have designed this fill method to be as simple as possible to get data back from a chat response and into fields in the Data Model. In many cases, all you need to do is add Data Elements to a Data Model to return results.

AI Extract uses the Data Elements' names, data types (string, datetime, etc.) and (optionally) descriptions to craft a prompt sent to an LLM chatbot. Then, it parses out the response, populating fields, sections and even table cells. As long as the Data Elements' names are descriptive ("Invoice Number" for an invoice number located on an invoice), that's all you need to locate the value in many cases. With no further configuration necessary, this is the fastest way to setup data extraction in Grooper data to date.

LLM Table Reader

LLM Table Reader is a new Table Extract Method for Data Tables. It uses the same logic as AI Extract. LLM Table Reader uses its child Data Columns' names and value types and any supplemental text entered in their Description properties to form a prompt sent to the LLM. The LLM then gives a response which is parsed into rows and cells for the Data Table.

Additional configuration includes all properties available to AI Extract, including different quoting methods, alignment options, header and footer detection, and multiline row support.

Clause Detection

Detecting contract clauses of a certain type has always been doable in Grooper using Field Classes. However, training the Field Class is a laborious and tedious process. This can be particularly taxing when attempting to locate several different clauses throughout a contract.

Large Language Models make this process so much simpler. LLMs are well suited to find examples of clauses in contracts. Natural language processing, after all, is their bread and butter. Clause Detection is a new Data Section extract method that uses chat responses to locate clauses in a contract. All you have to do is provide one or more written examples of the clause and Clause Detection does the rest. It parses the clause's location from the chatbot's response, which then forms the Data Section's data instance. This can be used to return the full text of a clause, extract information in the clause to Data Fields or both.

Ask AI

Ask AI is a new Grooper Extractor Type in Grooper 2024. It was created as a replacement for the "GPT Complete" extractor, which uses a deprecated method to call OpenAI GPT models. Ask AI works much like GPT Complete. It is an extractor configured with a prompt sent to a LLM chatbot and returns the chatbot's response.

Ask AI is more robust than its predecessor in that:

  • Ask AI has access to more LLM models, including those accessed via the OpenAI API, privately hosted GPT clones, and compatible LLMs from Azure's model catalog.
  • Ask AI can more easily parse JSON responses.
  • Ask AI has a mechanism to decompose chat responses into extraction instances (or "sub-elements"). This means Ask AI can potentially be used for a Row Match Data Table extractor.

Chat with your documents

Publicly accessible LLM chatbots like ChatGPT are always limited by what content they were trained on. The documents you're processing are probably not part of their training set. If they were, the LLM would be able to process it more effectively. You could even "chat" with your documents. You could ask more specific questions and get more accurate responses.

Now you can do just that! Using OpenAI's Assistants API, we've created a mechanism to quickly generate custom AI chatbot assistants in Grooper that can answer questions directly about one or more selected documents.

Build AI assistants with Grooper AI Analysts

AI Analysts are a new node type in Grooper that facilitate chatting with a document set. Creating an AI Analyst requires an OpenAI API account. AI Analysts create task-specific OpenAI "assistants" that answer questions based on a "knowledge base" of supplied information. Selecting one or more documents, users can chat with the assistant in Grooper about the documents. The text data from these documents form the assistant's knowledge base.

Using this mechanism, users can have a conversation with a single document or a Batch with hundreds of documents. Each conversion is logged as a "Chat Session" and stored as a child of the AI Analyst. These Chat Sessions can be accessed again (either in the Design Page's Node Tree or the Chat Page), allowing users to continue previous conversions.

The process of creating an AI Analyst and starting a Chat Session is fairly straightforward:

  1. Add an LLM Connector to the Grooper Repository Options.
  2. Create an AI Analyst.
  3. Select the documents you want to chat with. This can be done in multiple ways.
    • From a Batch Viewer or Folder Viewer.
    • From a Search Page query (more on the Search Page below).
    • From the Chat Viewer in Review
  4. Start a Chat Session. This can also be done in multiple ways.
    • Using the Discuss command
    • Using the AI Dialogue activity. This is a way of automating chat questions.
    • Using the Chat Viewer in Review

Chat in Review

The Chat View is a new Review View that can be added to a Review step in a Batch Process. This allows human operators a mechanism to chat with a document during Review. The Chat View facilitates a chat with an AI Analyst. Users may select one document or multiple documents and enter questions into the chat console. The human reviewer can ask questions to better understand the document or help locate information to complete their review.

Furthermore, if there are "stock questions" any Review user should be asking, the new AI Dialogue activity can automate a scripted set of questions with an AI Analyst. AI Dialogue starts a Chat Session for each document. Any "Predefined Messages" configured for the AI Analyst will be asked by the AI Dialogue activity in an automated Chat Session. The responses for the Chat Session are then saved to each Batch Folder. The answers to these questions can be then reviewed by a user during Review with a Chat View. This also allows users to continue the conversation with Predefined Messages getting the conversation started.

Chat Page

The Chat Page is a brand new UI page that allows users to continue previous Chat Sessions. Chat Sessions are archived as children of an AI Analyst . Each Chat Session is organized into subfolders by user name. The Chat Page allows users to access their previous Chat Sessions stored in these folders. Furthermore, since Chat Sessions are archived by user name, users will only have access to Chat Sessions created by their user session.

AI Search: Document search and retrieval

Traditionally Grooper has been solely a document processing platform. The process has always been (1) get documents into Grooper (2) condition them for processing (3) get the data you want from them Grooper 2024 (4) get the documents and data out of Grooper and then delete those Batches as soon as they are gone. Grooper was never designed to be a document repository. It was never designed to hold documents and data long-term. All that is changing starting in version 2024!

One of our big roadmap goals for Grooper is to evolve its content management capabilities. Our goal is to facilitate users who do want to keep documents in Grooper long term. There will be several advantages to keeping documents in Grooper long term:

  • You only need one system to manage your documents. No need to export content to a separate content management system.
  • Grooper's hierarchical data modeling allows the documents' full extracted data to be stored in Grooper, including more complex multi-instance data structures like table rows.
  • If you need to reprocess a document, you don't have to bring it back into Grooper. It's already present and conditioned for further processing.


In version 2024, we take our first step into realizing Grooper's document repository potential with AI Search. Any content management system worth its salt must have a document (and data) retrieval mechanism. AI Search uses Microsoft Azure's AI Search API to index documents and their data. Indexed documents can be retrieved by searching for them using the new Search Page. This page can be used for something as simple as full text searching or more advanced queries that return documents based on extracted values in their Data Model.

Basic AI Search Setup

Before you can start using the Search page to search for documents, there's some basic setup you need to perform. Some of these steps are performed outside of Grooper. Most are performed inside of Grooper.

Outside of Grooper

  1. Create an AI Search service in Azure.

Inside of Grooper

  1. Add AI Search to the Grooper Root node's Repository Options. Enter the URL and admin key for the Azure AI Search service (copied from Azure).
  2. Add an Indexing Behavior on a Content Model.
    • Documents must be classified in Grooper before they can be indexed. Only Document Types/Content Types inheriting an Indexing Behavior are eligible for indexing.
  3. Create the search index. To do this, right-click the Content Model and select "Search > Create Search Index"
    • This creates the search index in Azure. Without the search index created, documents can't be added to an index. This only needs to be done once per index.
  4. Submit an "Indexing Job" to index any documents classified using the Content Model currently in the Grooper Repository. To do this, right-click the Content Model and select "Search > Submit Indexing Job".
    • BE AWARE: An Activity Processing service must be running to execute the Indexing Job.
    • This is one of many ways to index documents using AI Search. For a full list (including a ways to automate document indexing) see below.

Repository Options: AI Search

Repository Options are new to Grooper 2024. They add new functionality to the whole Grooper Repository. These optional features are added using the Options property editor on the Grooper Root node.

To search documents in Grooper, we use Azure's AI Search service. In order to connect to an Azure AI Search service, the AI Search option must be added to the list of Repository Options in Grooper. Here, users will enter the Azure AI Search URL endpoint where calls are issued and an admin's API key. Both of these can be obtained from the Microsoft Azure portal once you have added an Azure AI Search resource.

With AI Search added to your Grooper Repository, you will be able to add an Indexing Behavior to one or more Content Types, create a search index, index documents and search them using the Search Page.

Indexing documents for search

Before documents can be searched, they must be indexed. The search index holds the content you want to search. This includes each document's full OCR or native text obtained from the Recognize activity and can optionally include Data Model results collected from the Extract activity. We use the Azure AI Search Service to create search indexes according to an Indexing Behavior defined for Content Types in Grooper. Documents are made searchable by adding them to a search index. Once indexed, you can search for documents using Grooper's Search page.

The Indexing Behavior: Defines the search index

Before indexing documents, you must add an Indexing Behavior to the Content Types you want to index. Most typically, this will be done on a Content Model. All child Document Types will inherit the Indexing Behavior and its configuration (More complicated Content Models may require Indexing Behaviors configured on multiple Content Types).


The Indexing Behavior defines:

  • The index's name in Azure.
  • Which documents are added to the index.
    • Only documents who are classified as the Indexing Behavior's Content Type OR any of its children Content Types will be indexed.
    • In other words, when set on a Content Model only documents classified as one of its Document Types will be indexed.
  • What fields are added to the search index (including which Data Elements from a Data Model are included, if any).
  • Any options for the search index in the Grooper Search page (included access restriction to the search index).

BE AWARE: Once an Indexing Behavior is added to a Content Type, you must use the "Create Search Index" command to create the index in Azure. Do this by right-clicking the Content Type and choosing "Search > Create Search Index".

With the Indexing Behavior defined, and the search index created, now you can start indexing documents.

Adding documents to the search index

Documents may be added to a search index in one of the following ways:

  • Using the "Add to Index" command.
    • This is the most "manual" way of doing things.
    • Select one or more documents, right-click them and select "Search > Add to Index" to add only the selected documents to the search index.
    • Documents may also be manually removed from the search index in this way by using the "Remove From Index" command.
  • Using the "Submit Indexing Job" command.
    • This is a manual way of indexing all existing documents for the Content Model.
    • The Indexing Job will add newly classified documents to the index, update the index if changes are made (to their extracted data for example), and remove documents from the index if they've been deleted.
    • Select the Content Model, right-click it and select "Search > Submit Indexing Job".
    • BE AWARE: An Activity Processing service must be running to execute the Indexing Job.
  • Using an Execute activity in a Batch Process to apply the "Add to Index" command to all documents in a Batch.
    • This is one way to automate document indexing.
    • Bear in mind, if documents or their data change after this step would run, they would still need to be re-indexed after changes are made.
  • Running the Grooper Indexing Service to index documents automatically in the background.
    • This is the most automated way to index documents.
    • The Grooper Indexing Service periodically polls the Grooper database to determine if the index needs to be updated. If it does, it will submit an "Indexing Job".
    • The Indexing Job will add newly classified documents to the index, update the index if changes are made (to their extracted data for example), and remove documents from the index if they've been deleted.
    • The Indexing Behavior's Auto Index property must also be enabled for the Indexing Service to sub
    • BE AWARE: An Activity Processing service must be running to execute the Indexing Job(s).

The Search Page

Once you've got indexed documents, you can start searching for documents in the search index! The Search page allows you to find documents in your search index.

The Search page allows you to build a search query using four components:

  • Search: This is the only required parameter. Here, you will enter your search terms, using the Lucene query syntax.
  • Filter: An optional filter to set inclusion/restriction criteria for documents returned, using the OData syntax.
  • Select: Optionally selects which fields you want displayed for each document.
  • Order By: Optionally orders the list of documents returned.

Search

The Search configuration searches the full text of each document in the index. This uses the Lucene query syntax to return documents. For a simple search query, just enter a word or phrase (enclosed in quotes "") in the Search editor. Grooper will return a list of any documents with that word or phrase in their text data.


Lucene also supports several advanced querying features, including:

  • Wildcard searches: ? and *
    Use ? for a single wildcard character and * for multiple wildcard characters.
  • Fuzzy matching: searchTerm~
    Fuzzy search can only be applied to single words (not phrases in quotes). Terms are matched based on a character edit distance of 0-2. Azure's full fuzzy search documentation can be found here: https://learn.microsoft.com/en-us/azure/search/search-query-fuzzy
    • Azure's implementation of "fuzzy matching" is not the same as Grooper's. Terms are matched based on a character edit distance of 0-2.
      • grooper~0 would only match "grooper" exactly.
      • grooper~ or grooper~1 would match any word that was up to one character different. For example, "trooper" "groopr" or "groopers".
      • grooper~2 would match any word that was up to two characters different. For example, "trouper" "looper" "groop" or "grooperey".
  • Boolean operators: AND OR NOT
    Boolean operators can help improve the precision of search query.
  • Field searching: fieldName:searchExpression
    Search built in fields and extracted Data Model values. For example, Invoice_No:8* would return any document whose extracted "Invoice No" field started with the number "8"
  • Regular expression matching: /regex/
    Enclose a regex pattern in forward slashes to incorporate it into the Lucene query. For example, /[0-9]{3}[a-z]/
    • Lucene regex searches are matched against single words.
    • Lucene regex does not use the Perl Compatible Regular Expressions (PCRE) library. Most notably, this means it does not use single-letter character classes, such as \d to match a single digit. Instead, enter the full character class in brackets, such as [0-9] to match a single digit.

Azure's full documentation of Lucene query syntax can be found here: https://learn.microsoft.com/en-us/azure/search/query-lucene-syntax

Filter

First you search, then you filter. The Filter parameter specifies criteria for documents to be included or excluded from the search results. This gives users an excellent mechanism to further fine tune their search query. Commonly, users will want to filter a search set based on the field values. Both built in index fields and/or values extracted from a Data Model can be incorporated into the filter criteria.

Azure AI Search uses the OData syntax to define filter expressions. Azure's full OData syntax documentation can be found here: https://learn.microsoft.com/en-us/azure/search/search-query-odata-filter

Select

The Select parameter defines what field data is returned in the result list. You can select any of the built in fields or Data Elements defined in the Indexing Behavior. This can be exceptionally helpful when navigating indexes with a large number of fields. Multiple fields can be selected using a comma separated list (e.g. Field1,Field2,Field3)

Order By

Order By is an optional parameter that will define how the search results are sorted.

  • Any field in the index can be used to sort results.
  • The field's value type will determine how items are sorted.
    • String values are sorted alphabetically.
    • Datetime values are sorted by oldest or newest date.
    • Numerical value types are sorted smallest to largest or largest to smallest.
  • Sort order can be ascending or descending.
    • Add asc after the field's name to sort in ascending order. This is the default direction.
    • Add desc after the field's name to sort in ascending order.
  • Multiple fields may be used to sort results.
    • Separate each sort expression with a comma (e.g. Field1 desc,Field2)
    • The leftmost field will be used to sort the full result list first, then it's sub-sorted by the next, then sub-sub-sorted by the next, and so on.

Search page commands: Create Batches, submit tasks, and download documents

There are several new commands users can execute from the Search page. These commands give users a new way of starting and continuing work in Grooper. These commands can be divided into two sets of commands: "result set commands" and "document commands"


Result set commands

These commands can be accessed from a dropdown list in the Search page UI. They can be applied to the entire result set or a selection from the result set.

  • Create Batch - Creates a Batch from the result set and submits an Import Job to start processing it.
  • Submit Job - Submits a Processing Job for documents in the result set. This command is intended for "on demand" activity processing.
  • Analyst Chat - Select an AI Analyst to start a chat session with the result set.
  • Download - Download a document, generated from the result set. May be one of the following:
    • Download PDF - Generates a single bookmarked PDF with optional search hit highlights.
    • Download ZIP - Generates a ZIP file containing each document in the result set.
    • Download CSV - Generates a CSV file from the result set's data fields.
    • Download Custom - Generates a custom document using an "AI Generator"


Document commands

These commands can be accessed from the Search page when right-clicking a document in the result list.

  • Go to Document - Navigates users with Design page permissions to that document in the Grooper node tree.
  • Review Document - Opens the document in a Review Viewer with a Data View, Folder View and Thumbnail View.
  • Copy Link - Creates a URL link to the document. When clicking the link users will be taken to a Review Viewer with a Data View, Folder View and Thumbnail View.

Document "Generators"

AI Generators are required to generate custom files using the Search Page's Download command (and Download Custom format). AI Generators allow users to create new documents from the results of a search query. This makes it possible to make and download simple reports from a result set in the Search Page.

  • AI Generators use a large language model (LLM) from an LLM Connector to generate the document.
  • One or more AI Generators can be configured in the Content Type's Indexing Behavior settings.
  • The generator creates a document using a natural language prompt entered in the Instructions property.
    • The instructions define what should make up the generated document. The LLM uses these instructions and the text content of the search results (or their field values when Document Quoting is set to Data Values) to generate a document.
    • As part of the instructions, you must detail a text-based file type (TXT, CSV, HTML, etc) to be generated.
  • Once configured, the generator's custom document can be downloaded using the Search Page. The generator will be an option using the Search Page's Download command when setting the Format to "Download Custom".


Example instructions to generate a "contact list" from the text content of a search result's document set:

Generate a CSV document listing all entities (people, companies, government) mentioned in the document quotes.  

List the entities in a table with the following columns:

Name - The name of the person or organization.
Address - The street address.
City - The city name.
State - The US state name.
ZIP - The ZIP code.
Phone - Phone number
Email - Email Address

Be sure to include every person name and company name in the list.  
If the value for a field cannot be determined, leave it blank.

Simplified Batch architecture

The traditional Batch architecture in Grooper is unnecessarily bloated. All Batches are created with a copy of a published Batch Process, created and stored as one of its children. This creates an exceptional number of duplicated nodes, not only the copy Batch Process but all its child steps too. To make Batches leaner and processing Batches more efficient, we have simplified this structure.

In brief:

  • Batches no longer house copies of Batch Process. Instead, they simply reference a published Batch Process.
  • With the child Batch Process gone, there is no need
  • Batches are now more truly just a container for document content. This has the following advantages:
    • It makes them easier to process.
    • It makes it easier for AI Search to index.
    • It easier to keep them around in Grooper long term.


Looking towards Grooper's future, this new architecture will help Grooper be a document repository as well as a processing platform. This will allow us to move from a "batch processing" focused design to a "document processing" focused design. If documents are going to hang around in Grooper permanently, it needs to be easier to process them "in place", wherever they are in the repo. The simplified Batch architecture implemented in version 2024 will aid us in this goal.

Good-bye local Batch Process children!

Batches no longer store a local copy of a Batch Process as a child.

In the past, whenever Grooper creates a Batch, it stores a read-only copy of the Batch Process used to process it as one of its children. This is inefficient, especially when processing Batches with a single document (or in other words, just one document). Every Batch that comes into Grooper just has an extra Batch Process and set of Batch Process steps tied to it. These additional nodes clutter up the Grooper database and makes querying Batches more inefficient than it needs to be.

In 2024, Batches will no longer house a clone of a Batch Process. Instead they will reference a published Batch Process. Each published version of a Batch Process is archived permanently (until a Grooper designer deletes unused processes).

Good-bye "root folder" node!

Batches no longer have a "root folder". The result is a simpler, more logical folder hierarchy.

The only reason Batches had a root folder was to distinguish the folder and page content from the local copy of a Batch Process. Because there is no longer a Batch Process child, there is no need for a root folder. So, its gone!

Instead, the Batch object itself is the root of the Batch. Batches now have all the properties of a Batch Folder as well as a Batch. This makes Batches more lightweight, particularly for single-document Batches.

  • For single-document Batches, the Batch is not just a container for documents, but in effect, the document itself!
  • For Batches with multiple documents, the Batch now acts as the root folder. This gets rid of a now unnecessary (and previously often confusing) level in the Batch hierarchy.

Hello new Batch testing tools!

There are new tools to help facilitate testing from the Design page.

The only potential drawback to the Batch redesign comes in testing. In the past, Grooper designers would use the local Batch Process copies to test steps in production Batches. If there is no longer a local copy, how are users going to test production Batches in this way?

There are several new tools that make testing production Batches easier.

  • Published versions of Batch Process will now be able to access the production branch of the Batches tree for Batch testing.
  • Production Batches have a "Go To Process" button. Pressing this button will navigate to the Batch's referenced process and selects the Batch in the Activity Tester.
  • Published versions of Batch Processes now have a "Batches" tab. This will show a list of all Batches currently using the selected process. These Batches can then be managed the same way they would be managed from the Batches Page.

Bonus! New Batch naming options

While not directly related to the Batch redesign, we have a new set of Batch Name Option properties for in version 2024. These options can be configured for Batches created by Import Jobs (either ad-hoc from the Imports page or procedurally by an Import Watcher). Previously, users could only affix a text prefix to a Batch when importing documents. The Batch would be named using the prefix and a time stamp (e.g. "Batch Prefix 2024-06-12 03:14:15 PM").


Users can now name Batches with a text prefix, one to three "segments", and a text suffix. This gives users a lot more flexibility in what they can name Batches created from imports. The "segments" may be set to one of the following:

  • Sequence - A sequence number of the current Batch. The first Batch imported will be "1" then "2" and so on. This sequence may optionally be zero-padded ("00001" then "00002" and so on)
  • DateTime - The current date and time.
  • Process - The name of the assigned Batch Process.
  • ContentType - The name of the Content Type assigned to the Batch.
  • Username - The current Windows user's logon name.
  • Machine - The name of the current machine.
  • BatchId - The integer id number for the batch.

Miscellaneous

Improvements to Data Model expression efficiency

In this version, we refactored Data Model expressions (Default Value, Calculated Value, Is Valid, and Is Required) to address some longstanding issues. Specifically, we have improved how Data Model expressions are compiled. This will improve performance, particularly for large Data Models with a large number of Data Elements with a large number of expressions.

  • In previous versions, a representation of the Data Model was compiled with each expression. This slowed things down and caused expressions to break unncessicarily in certain scenarios.
  • In 2024, we resolve this by compiling each Data Model a single time.
    • The compiled Data Model is called a "Data Model assembly". Expression assemblies reference Data Model assemblies, rather than defining types internally.
    • BE AWARE: When testing expression configurations, the Data Model assemblies must be recompiled when changes to the Data Model, its Data Elements, or their expressions are made.
      • Do this by selecting the Data Model in the Node Tree and clicking the "Refresh Tree Structure" button in the upper right corner of the screen.
      • Example: You change a Data Field's type from "string" to "decimal" so its Calculated Value expression adds numbers together correctly. Before testing this expression, you should select the Data Model and refresh the node to recompile its assembly.

Service operation hours

All Grooper services (such as Activity Processing and Import Watcher services) have a new property: “Hours of Operation”. Adjusting this property will control when the service is allowed to run.

  • May contain a comma-separated list of time ranges.
  • For example: 8am – 12pm, 1:30pm-6:30pm
  • Specifies the hours when the service should be active.
    • Ex: Activity Processing will only process tasks during operating hours.
    • Ex: Import Watcher will only process import jobs or perform job scheduled jobs during operating hours.
  • Allows Grooper processing to be scheduled around regular maintenance and backup windows that normally interrupt service operations.

Conditional CSS class variable

There is now a hidden Data Model Variable called “CSS_Class”. This variable can be used to conditionally apply CSS classes to a Data Model.

A common use case example:

Hide a field based on the value of another field. Only show “Field 2” if “Field 1” is greater than 1,000.

General Steps:

  1. Add the CSS class (and rules) to the Data Model’s Style Sheet.
  2. Use the Data Model’s Variables editor to add a new Variable.
    • It must be named “CSS_Class”
    • It must be a String type.
  3. Set the expression to evaluate to the CSS class’s name. Most commonly, this will be an If() statement.
    • Ex: If(Field_1 < 1000, "custom-class-name", "")

New Variable Definition: Is Array

  • Variable Definition update: New "Is Array" property
  • This is a new Boolean property that can be enabled when adding a new Variable Definition to a Data Model’s Variables editor.
  • When set, indicates the variable returns a sequence of values, rather than a single value
    • More specifically, it means the expression returns an IEnumerable of the variable type.
    • For example, if the variable type is string, the expression would return an IEnumerable of string.
  • Useful for creating helper variables which are used by other expressions

Auto Orient during Recognize

Recognize now has a “Correct Orientation” property. When enabled, page orientation will automatically be corrected during Recognize.

  • Correcting orientation at Recognize has several advantages over the “Auto Orient” IP Command:
    • It is generally more efficient than running Auto Orient in an IP Profile. Auto Orient makes a basic OCR pass to detect orientation. Correct Orientation avoids the need for this unnecessary OCR pass.
    • Where Auto Orient can use either a basic Transym or Tesseract engine to detect orientation, Correct Orientation will use whatever OCR engine you’re using during Recognize.
    • Will work with both image-based page and native PDF pages.
    • For the first time, Azure OCR results can be used to correct orientation.
    • For the first time, native text can be used to correct orientation.
  • Correct Orientation only functions when Recognize is scoped to the page level. It has no impact when running at the folder level.

Tabular View in Data Review

"Tabular View" is a new Data Viewer UI experience. It allows users to more quickly and easily review a smaller set of fields as opposed to the entire Data Model.

  • When “Tabular View” is enabled, a virtual table is displayed, comprised of selected Data Fields or Data Columns.
    • Like all Review Views, the "Display Name" property allows the Data View tab’s name to be customized.

For example, in an EOB model it could be configured to condense data from claim sections into a simple table. In the screenshot below, one row is created for each "Claim" Data Section instance. The columns in the table are the selected Data Fields in the Tabular View configuration.


Be aware of a few key concerns and limitations to using Tabular View.

  • It is most useful to think of this as a supplementary Data View UI.
    • This won’t be a replacement for the normal Data View in most cases.
    • This is a specialized UI limiting the number of fields a clerk reviews.
    • This is a UI to quickly find errors in a smaller list of fields.
  • Tabular View has some limitations
    • There is not currently a good way to display how many documents are invalid (like traditional Data View does).
    • You will need to review the full document set to know all data errors have been cleared.
    • There is not currently a way to add instances (section records or table rows).
    • There is not currently a way for Tabular View to "remember" what field a user has selected when flipping to a traditional Data View tab.

Batch Archive command

New “Archive” command for Batches

Functionality

  • Archives Batches to another folder, organized into year/month/day subfolders.
  • Optionally clears the job history.

Purpose

  • Historically, customers have deleted Batches after processing.
    • However, going forward, many customers will continue to store documents in Grooper.
    • For example, to use AI Search.
  • We can’t store them all in the Production branch
    • Doing so will make the Batches UI less responsive.
    • "Archive" resolves this by moving them to the Test branch in an organized fashion.
  • We can’t allow the ProcessingJob and ProcessingTask tables to become overpopulated with historical data
    • Doing so will slow down Grooper when querying those tables.
    • "Archive" resolves this when clearing the job history.

FYI

The "Dispose Batch" activity also supports organizing Batches into a 3 level "YYYY/MM/DD" folder hierarchy with new "Move To" property options.

Improvements to Grooper Desktop and scanning

There are a number of changes and improvements to scanning in Grooper users should be aware of.

  • Users no longer need to configure device selection from Grooper Desktop
    • Grooper Desktop no truly just runs in the background.
    • Grooper Desktop still must be running to scan in the web client.
  • Device selection and configuration is now done using the Scan Viewer with two new buttons.
    • Use the “Scanner Settings” button to select a device.
    • Use the “Device Settings” to configure settings local to the scanner (color mode, dpi, scanning side, etc).
  • The Scan Viewer will “remember” the last scanner settings used (stored in the browser cache).
  • Scanner Profile selection remains unchanged from previous versions of the web client.
  • In the absence of the thick client, the only way to create new Scanner Profiles is to upload configuration settings from the Scan Viewer to a new Scanner Profile.

For more in depth explaination of how to scan documents in Grooper version 2024, visit the Scan Viewer article.

New URL endpoint for scanning

http://{serverName}/Grooper/Review/Scan?repositoryId={repositoryId}&processName={processName}&batchName={batchName}

This endpoint creates a new Batch using a Batch Process whose first step is a "scan step" (a Review step with a Scan Viewer). Use this endpoint to hit a link and start scanning documents into a new Grooper Batch.

Required Parameters: repositoryId (GUID), processName (string), batchName (string)

You must pass the following in the URL:

  • repositoryId - The Grooper Repository's ID
  • processName - The name of the published Batch Process used for scanning
  • batchName - Whatever you would like to name the new Batch

The following URL would create a new Batch, using the Batch Process named "Scanned Invoices" (&processName=Scanned%20Invoices), name it "Scan_Session_001" (&batchName=Scan_Session_001) and take the user to the Review step to start scanning:

http://serverName/Grooper/Review/Scan?repositoryId=00000000-0000-0000-0000-000000000000&processName=Scanned%20Invoices&batchName=Scan_Session_001

"Chunk Size" for extractors

Version 2024 adds a “Chunk Size” property to the Pattern Match and List Match extractor types.

  • When set, enables “chunked processing” where large documents are broken into chunks of N pages. The extractor is executed on each chunk separately and hits are collected into a final result set.
  • This is implemented to get around processing limitations when performing fuzzy matches against large documents (thousands of pages). This will allow fuzzy matched extractors to execute effectively on documents with thousands of pages.
  • The default is “1000”. In cases where documents are less than 1000 pages, users will experience no differences in their current extractor configurations. Users experiencing slower extract speeds with documents larger than 1000 pages and should increase the “Chunk Size” value. This should be exceptionally rare and should only affect extractors executed on documents larger than 1000 pages with Fuzzy Matching disabled.

Label Set changes

  • The Label Set Editor only matches labels on the first and last 100 pages of large documents
    • This change is intended to speed up UI performance when collecting labels from large documents.
  • Added “Alternate Versions” for Label Set labels.
    • This allows multiple variations of a label to be specified for a single Document Type.
    • This is useful for Document Types with a single or small number of labels that are slightly different from one form to the next. A single Document Type can account for these slight variations without creating a wholly new Document Type or setting up complicated overrides.
    • Please note: There is no way to create these alternate labels using the editor UI at this time. Labels must be typed and configured manually (or copied and pasted).

Labeled Value improvement

The Labeled Value extractor type is designed with special functionality when its Value Extractor is not configured. This functionality is intended to collect common label-value layouts without the need to configure an explicit Value Extractor. This functionality has been expanded in this version.

Version 2024 adds "Text Wrapping" property to the Labeled Value extractor type. This will allow Labeled Value to capture results wrapping to multiple lines without any Value Extractor configuration in given these conditions:

  • The label must be left of the value or above the value
  • Labeled Value will consume aligned segments until the next label is encountered.
  • When a label is left of the value, wrapped text may not occupy space below the label.

All other updates and improvements

Improved Import ZIP, Export ZIP and Publish operations

  • The ‘Publish to Repository’ command now works with any descendant of the ‘Projects’ folder.
  • Import ZIP, Export ZIP, and Publish now support multiple selection.

Data Model “Variable Definitions” now include a format specifier for numeric and date/time variables.

  • The format will be used when the variable is displayed.

The “Batch Folder – Sort Children” command can now sort by attachment name.

The “Zip File – Unzip” command can now sort in name order when extracting attachments.

The ZIP Viewer now sorts by filename.

Speech recognition has been added to

  • AI Analyst Chat Editors – Click the microphone button to ask your question.
  • Data Grid – Hold F12 while in a field
  • Code Editor – Hold F12 while editing text.

“Display Name” property added to Task Views. This allows users to enter custom names displayed on the tab controller for each Task View added to a Review step.

NEW “Propagation” property for Data Columns.

  • Replicates captured values to adjacent empty cells.
  • Useful when a value is only listed for one row (usually the first) but needs to be copied to every row in the table (or every row until a new value is extracted).
  • Works in up or down direction.

NEW “Clean Overrides” command on Content Types.

NEW “Array Editing” feature in the Data Grid

  • Simplifies editing of multi-cardinality fields during data review.
  • Use “Alt + Down Arrow” to drop down the editor.

NEW CMIS Document Link commands:

  • Added “Overwrite” property to the Load command
  • Created new “Move” command.

NEW “Content Type / Documents” tab

  • A new tab has been added for Content Models, Content Categories and Document Types
  • Displays a list of all documents with the type assigned (includes child/derived types)
  • Includes a Document Viewer to display a selected document.
  • Made possible by the new DocIndex table

Jobs Page improvements

  • NEW The jobs list can now be filtered by Status, Process, Step or Activity. Filtering helps admins find processing jobs to diagnose issues like canceling stuck jobs.
  • Improved the refresh rate for all visible jobs. Users will still need to refresh the page to populate the list with new jobs.

NEW Data Field/Data Column command in Data Viewer: "Confirm All"

  • Confirms all instances of a Data Field in a multi-instance Data Section or Data Column values.

NEW “Go To” button in Tester tabs.

  • Navigates to the selected document/page in the Grooper node tree.
  • Handy way to navigate from the Test Source UI to the document/page’s node location.

Stats reporting enhancements

  • Added stats logging for ad-hoc jobs.
  • Now includes a “Tasks Processed” stat
  • Stats queries now have 3 new “Output” properties: “Scale” “Number Format” and “Time Format”
  • The “Kiosk View” can now display three reports side-by-side when displaying query results in a new browser window. If more than three reports are selected, the Kiosk View will cycle reports in and out from left (oldest) to right (newest).

NEW “Organize By Date” property for Import Providers

  • Will organize Batches created into 3 level “YYYY/MM/DD” folder hierarchy on import.
  • Similar to how the “Archive Batch” organizes Batches in the Test branch.
  • Useful for organizing large amounts of imported Batches in the Production branch.

NEW “Extractor Conversion Commands”

  • “Convert to Value Reader”: Converts a Data Type to a Value Reader
  • “Convert to Data Type”: Converts a Value Reader to a Data Type

Azure OCR performance enhancements

  • Azure OCR will not run traditional OCR if orientation detection produced 0 characters
  • Secondary iterations, cell validation, and segment reprocessing are all limited by a maximum amount of OCR time.

NEW “Invert” button when editing checkbox groups in the property grid gives a quick way to invert selection.

NEW “Divider” section extract method property: “Line Offset”

  • Indicates the number of text lines to move up or down from each Divider Extractor hit.
  • Positive values indicate the number to move down.
  • Negative values indicate the number to move up.

Added user time zone (USER TZ) and server time zone (SERVER TZ) information to the User Info button.

Modified CMIS import (Import Descendants and Import Query Results) to support importing multi-cardinality field values.

Improved the Diagnostics Viewer to perform well even when 10,000+ entries are present.


New Data Table Extract Method: LLM Table Reader

  • Runs large language model (LLM) based extraction at the table level.
  • This functionality is based on the same logic as the AI Extract Fill Method.
  • Context can be restricted by a Header Extractor and a Footer Extractor. When using Label Sets, labels can take the place of Header and/or Footer Extractors.

New icons

Object icons

Object

New

Old

Architecture Objects

Grooper Root

database

File Store

hard_drive

Machine

computer

Object

New

Old

Batch Objects

Batch

inventory_2

Batch Folder

folder

Batch Page

contract

Object

New

Old

Organization Objects

Folder

folder_open

Project

package_2

Local Resources Folder

folder_data

Object

New

Old

Processing Objects

Batch Process

settings

Batch Process Step

edit_document

Processing Queue

memory

Review Queue

person_play

Object

New

Old

Content Type Objects

Content Model

stacks

Content Category

collections_bookmark

Document Type

description

Form Type

two_pager

Page Type

article

Object

New

Old

Data Element Objects

Data Model

data_table

Data Field

variables

Data Section

insert_page_break

Data Table

table

Data Column

view_column

Object

New

Old

Extractor Objects

Value Reader

quick_reference_all

Data Type

pin

Field Class

input

Object

New

Old

Connection Objects

CMIS Connection

cloud

CMIS Repository

settings_system_daydream

Data Connection

database

Object

New

Old

Profile Objects

IP Profile

perm_media

IP Group

gallery_thumbnail

IP Step

image

OCR Profile

library_books

Scanner Profile

scanner

Separation Profile

insert_page_break

Object

New

Old

Miscellaneous Objects

Control Sheet

document_scanner

Data Rule

flowsheet

Lexicon

dictionary

Object Library

extension

Activity icons

Activity

New

Old

Review

Review

person_search

Cleanup and Recognition

Correct

abc

File:GrooperIcon Correct.png

Detect Language

travel_explore

File:GrooperIcon DetectLanguage.png

Image Processing

wallpaper

File:GrooperIcon ImageProcessing.png

Recognize

format_letter_spacing_wide

File:GrooperIcon Recognize.png

Activity

New

Old

Document Processing

Apply Rules

flowsheet

File:GrooperIcon ApplyRules.png

Classify

unknown_document

Convert Data

switch_access_2

File:GrooperIcon ConvertData.png

Export

output

Extract

export_notes

Redact

format_ink_highlighter

File:GrooperIcon Redact.png

Separate

insert_page_break

File:GrooperIcon Separate.png

Activity

New

Old

Transform

Burst Book

auto_stories

File:GrooperIcon BurstBook.png

Merge

file_save

File:GrooperIcon Merge.png

Render

print

File:GrooperIcon Render.png

Split Pages

file_copy

File:GrooperIcon SplitPages.png

Split Text

receipt

File:GrooperIcon SplitText.png

Text Transform

insert_text

File:GrooperIcon TextTransform.png

Translate

translate

File:GrooperIcon Translate.png

XML Transform

code_blocks

File:GrooperIcon XMLTransform.png

Activity

New

Old

Utilities

Batch Transfer

Template:BatchTransferIcon

File:GrooperIcon BatchTransfer.png

Deduplicate

Template:DeduplicateIcon

File:GrooperIcon Deduplicate.png

Dispose Batch

inventory_2

Execute

tv_options_edit_channels

File:GrooperIcon Execute.png

Launch Process

Template:LaunchProcessIcon

File:GrooperIcon LaunchProcess.png

Remove Level

Template:RemoveLevelIcon

File:GrooperIcon RemoveLevel.png

Send Mail

forward_to_inbox

File:GrooperIcon SendMail.png

Spawn Batch

Template:SpawnBatchIcon

File:GrooperIcon SpawnBatch.png

Train Lexicon

Template:TrainLexiconIcon

File:GrooperIcon TrainLexicon.png

Activity

New

Old

AI

AI Dialogue

network_intelligence_update

N/A

Microform Processing

Clip Frames

view_module

File:GrooperIcon ClipFrames.png

Detect Frames

view_module

File:GrooperIcon DetectFrames.png

Initialize Card

view_module

File:GrooperIcon InitializeCard.png

Database changes

Changes to existing Grooper database tables:

  • TreeNode
    • Replaces RowVersion with LastModified
    • Will be used for search indexing purposes
  • ProcessingJob
    • Added an "Activity" column.
    • Allows jobs to be submitted from the search page.


New Grooper database tables

  • "DocIndex" & "IndexState" were added to improve the efficiency of the Indexing Service.
    • DocIndex - One row for each Content Type assignment
      • Columns:
      • NodeId - Batch Folder id
      • TypeId - Content Type id
    • Columns:
    • IndexState – One row for each indexed document.
      • FolderId – Batch Folder id.
      • TypeId – Content Type id.
      • LastUpdated – Index date.
      • PropHash – Hash of root properties at last index.
      • DataHash – Hash of index data at last index.
      • TextHash – Hash of text content at last index.