2021:Lexical (Classify Method)

From Grooper Wiki
Revision as of 09:33, 22 April 2024 by Randallkinard (talk | contribs)

This article is about an older version of Grooper.

Information may be out of date and UI elements may have changed.

202520232021

This glossary seeks to educate readers on various Grooper terms, objects and other entities. Glossary entries will be short paragraphs describing the topic. For each glossary entry, you will find links to a full article about the entry as well as articles on associated terms.

Each entry is organized according to what major Grooper entity they belong to. For example, "Classify" is an "Activity". It is found in the "Activity" section of the Glossary.

Application

Grooper is an intelligent document processing platform that uses an array of sophisticated techniques to automate end-to-end content capture and delivery. From a technical standpoint, Grooper consists of a Grooper Repository and the applications that the support management and execution of configuration assets.

  • A Grooper Repository consists of two things: (1) A series of tables in a SQL database (containing configuration nodes and their properties) and (2) a File Store (containing files associated to nodes in the database).

The Grooper applications are as follows:

  • Grooper - The primary program files for the Grooper platform. This application will need to be installed on any Grooper web server hosting the Grooper UI and processing servers running Activity Processing services to automate task processing.
  • Grooper Command Console - This is an administrative utility that gets installed with the Grooper application.
  • Grooper Web Client - This application installs the Grooper user interface. It will need to be installed on the Grooper web server. The Grooper web server hosts the Grooper web app which is accessed via a URL.
  • Grooper Desktop - This is a lightweight application required to scan documents using the Grooper web app. It runs in the background and helps operate the Scan Viewer in Grooper. It needs to be installed on any workstation connected to a document scanner.

Grooper Command Console

Grooper Command Console is a command-line interface that performs system configuration and administration tasks within Grooper.

Grooper Web Client

The Grooper user interface is accessed using a web browser from a URL. The Grooper Web Client is the application that installs the Grooper website on a web server.

Node Types

Grooper.GrooperNode

Nodes are the main configuration objects in Grooper. They are created and accessed in the Node Tree from the Design page. The different types of nodes ("Node Types") serve different functions in Grooper. For example, "Batch" nodes are the primary container for document content. They contain "Batch Folder" nodes which represent documents and "Batch Page" nodes which represent individual pages of documents.

AI Analyst

BE AWARE: AI Analysts are obsolete as of version 2025. See AI Assistant for the new and improved version of AI Analyst. An AI Analyst facilitates the ability to interact with a document as you might with an AI chatbot.

AI Assistant

Grooper.GPT.AIAssistant

AI Assistants are Grooper's conversational AI personas. They answer questions about resources it can access (including content from documents, databases and/or web services). This greatly increases an AI's ability to answer domain-specific questions that require access to these resources.

Batch Objects

Grooper.Core.BatchObject

Batch Objects are the foundational elements of Grooper's document processing system, providing a unified structure for organizing, processing, and reviewing document content within a inventory_2 Batch. Every item within a Batch—whether a document, folder, or page—is represented as a Batch Object (and Batches themselves are Batch Objects too).

Batch

Grooper.Core.Batch

inventory_2 Batch nodes are fundamental in Grooper's architecture. They are containers of documents that are moved through workflow mechanisms called settings Batch Processes. Documents and their pages are represented in Batches by a hierarchy of folder Batch Folders and contract Batch Pages.

Batch Folder

Grooper.Core.BatchFolder

The folder Batch Folder is an organizational unit within a inventory_2 Batch, allowing for a structured approach to managing and processing a collection of documents. Batch Folder nodes serve two purposes in a Batch. (1) Primarily, they represent "documents" in Grooper. (2) They can also serve more generally as folders, holding other Batch Folders and/or contract Batch Page nodes as children.

  • Batch Folders are frequently referred to simply as "documents" or "folders" depending on how they are used in the Batch.

Batch Page

Grooper.Core.BatchPage

contract Batch Page nodes represent individual pages within a inventory_2 Batch. Batch Pages are created in one of two ways: (1) When images are scanned into a Batch using the Scan Viewer. (2) Or, when split from a PDF or TIFF file using the Split Pages activity.

  • Batch Pages are frequently referred to simply as "pages".

Batch Process

Grooper.Core.BatchProcess

settings Batch Process nodes are crucial components in Grooper's architecture. A Batch Process is the step-by-step processing instructions given to a inventory_2 Batch. Each step is comprised of a "Code Activity" or a Review activity. Code Activities are automated by Activity Processing services. Review activities are executed by human operators in the Grooper user interface.

  • Batch Processes by themselves do nothing. Instead, they execute edit_document Batch Process Steps which are added as children nodes.
  • A Batch Process is often referred to as simply a "process".

Batch Process Step

Grooper.Core.BatchProcessStep

edit_document Batch Process Steps are specific actions within a settings Batch Process sequence. Each Batch Process Step performs an "Activity" specific to some document processing task. These Activities will either be a "Code Activity" or "Review" activities. Code Activities are automated by Activity Processing services. Review activities are executed by human operators in the Grooper user interface.

  • Batch Process Steps are frequently referred to as simply "steps".
  • Because a single Batch Process Step executes a single Activity configuration, they are often referred to by their referenced Activity as well. For example, a "Recognize step".

CMIS Connection

Grooper.CMIS.CmisConnection

cloud CMIS Connections provide a standardized way of connecting to various content management systems (CMS). CMIS Connections allow Grooper to communicate with multiple external storage platforms, enabling access to documents and document metadata that reside outside of Grooper's immediate environment.

  • For those that support the CMIS standard, the CMIS Connection connects to the CMS using the CMIS standard.
  • For those that do not, the CMIS Connection normalizes connection and transfer protocol as if they were a CMIS platform.

CMIS Repository

Grooper.CMIS.CmisRepository

settings_system_daydream CMIS Repository nodes provide document access in external storage platforms through a cloud CMIS Connection. With a CMIS Repository, users can manage and interact with those documents within Grooper. They are used primarily for import using Import Descendants and Import Query Results and for export using CMIS Export.

  • CMIS Repositories are create as a child node of a CMIS Connection using the "Import Repository" command.

Content Types

Grooper.Core.ContentType

Content Types are a class of node types used used to classify folder Batch Folders. They represent categories of documents (stacks Content Models and collections_bookmark Content Categories) or distinct types of documents (description Document Types). Content Types serve an important role in defining Data Elements and Behaviors that apply to a document.

Content Model

Grooper.Core.ContentType

stacks Content Model nodes define a classification taxonomy for document sets in Grooper. This taxonomy is defined by the collections_bookmark Content Categories and description Document Types they contain. Content Models serve as the root of a Content Type hierarchy, which defines Data Element inheritance and Behavior inheritance. Content Models are crucial for organizing documents for data extraction and more.

Content Category

Grooper.Core.ContentCategory

collections_bookmark A Content Category is a container for other Content Category or description Document Type nodes in a stacks Content Model. Content Categories are often used simply as organizational buckets for Content Models with large numbers of Document Types. However, Content Categories are also necessary to create branches in a Content Model's classification taxonomy, allowing for more complex Data Element inheritance and Behavior inheritance.

Document Type

Grooper.Core.DocumentType

description Document Type nodes represent a distinct type of document, such as an invoice or a contract. Document Types are created as child nodes of a stacks Content Model or a collections_bookmark Content Category. They serve three primary purposes:

  1. They are used to classify documents. Documents are considered "classified" when the folder Batch Folder is assigned a Content Type (most typically, a Document Type).
  2. The Document Type's data_table Data Model defines the Data Elements extracted by the Extract activity (including any Data Elements inherited from parent Content Types).
  3. The Document Type defines all "Behaviors" that apply (whether from the Document Type's Behavior settings or those inherited from a parent Content Type).

Form Type

Grooper.Core.FormType

two_pager Form Types represent trained variations of a description Document Type. These nodes store machine learning training data for Lexical and Visual document classification methods.

Page Type

Grooper.Core.PageType

article Page Types represent individual pages of a two_pager Form Type. These nodes store page-level machine learning training data for Lexical and Visual document classification methods. Page Types are used by ESP Auto Separation to make document separation decisions based on page classification.

Control Sheet

Grooper.Capture.ControlSheet

document_scanner Control Sheets are printable pages used to automate document separation at scan time. Control Sheets are placed before each new document before loading pages into the scanner. Then, when pages are scanned using the Scan Viewer and Control Sheet Separation is executed, a new folder Batch Folder is created for every Control Sheet scanned. Control Sheets can also be configured to assign the Batch Folder a description Document Type, thus classifying the document at scan time as well.

Data Connection

Grooper.Core.DataConnection

database Data Connections connect Grooper to Microsoft SQL and supported ODBC databases. Once configured, Data Connections can be used to export data extracted from a document to a database, perform database lookups to validate data Grooper collects and other actions related to database management systems (DBMS).

  • Grooper supports MS SQL Server connectivity with the "SQL Server" connection method.
  • Grooper supports Oracle, PostgreSQL, Db2, and MySQL connectivity with the "ODBC" connection method.

Data Elements

Grooper.Core.DataElement

Data Elements are a class of node types used to collect data from a document. These include: data_table Data Models, insert_page_break Data Sections, variables Data Fields, table Data Tables, and view_column Data Columns.

Data Model

Grooper.Core.DataModel

data_table Data Models are leveraged during the Extract activity to collect data from documents (folder Batch Folders). Data Models are the root of a Data Element hierarchy. The Data Model and its child Data Elements define a schema for data present on a document. The Data Model's configuration (and its child Data Elements' configuration) define data extraction logic and settings for how data is reviewed in a Data Viewer.

Data Field

Grooper.Core.DataField

variables Data Fields represent a single value targeted for data extraction on a document. Data Fields are created as child nodes of a data_table Data Model and/or insert_page_break Data Sections.

  • Data Fields are frequently referred to simply as "fields".

Data Section

Grooper.Core.DataSection

A insert_page_break Data Section is a container for Data Elements in a data_table Data Model. variables They can contain Data Fields, table Data Tables, and even Data Sections as child nodes and add hierarchy to a Data Model. They serve two main purposes:

  1. They can simply act as organizational buckets for Data Elements in larger Data Models.
  2. By configuring its "Extract Method", a Data Section can subdivide larger and more complex documents into smaller parts to assist in extraction.
    • "Single Instance" sections define a division (or "record") that appears only once on a document.
    • "Multi-Instance" sections define collection of repeating divisions (or "records").

Data Table

Grooper.Core.DataTable

A table Data Table is a Data Element specialized in extracting tabular data from documents (i.e. data formatted in rows and columns).

  • The Data Table itself defines the "Table Extract Method". This is configured to determine the logic used to locate and return the table's rows.
  • The table's columns are defined by adding view_column Data Column nodes to the Data Table (as its children).

Data Column

Grooper.Core.DataColumn

view_column Data Columns represent columns in a table extracted from a document. They are added as child nodes of a table Data Table. They define the type of data each column holds along with its data extraction properties.

  • Data Columns are frequently referred to simply as "columns".
  • In the context of reviewing data in a Data Viewer, a single Data Column instance in a single Data Table row, is most frequently called a "cell".

Data Field Container and Data Element Container

Grooper.Core.DataFieldContainer
Grooper.Core.DataElementContainer

Data Field Container and Data Element Container are two base types in Grooper from which "container" Data Elements are derived. Container Data Elements (data_table Data Models, insert_page_break, Data Sections table Data Tables) serve an important function in organizing and defining behavior and extraction logic for the variables Data Fields and view_column Data Columns they contain.

  • While "Data Field Container" and "Data Element Container" are distinct classes in the Grooper Object Model, they are closely related. While Grooper scripters/experts should know the difference, for most practical purposes, the terms are used interchangeably (or they're just called "containers" or "container elements"). See Object Model info for more.

Data Rule

Grooper.Core.DataRule

flowsheet Data Rules are used to normalize or otherwise prepare data collected in a data_table Data Model for downstream processes. Data Rules define data manipulation logic for data extracted from documents (folder Batch Folders) to ensure data conforms to expected formats or meets certain standards.

  • Each Data Rule executes a "Data Action" which do things like computing a field's value, parse a field into other fields, perform lookups, and more.
  • Data Actions can be conditionally executed based on a Data Rule's "Trigger" expression.
  • A hierarchy of Data Rules can be created to execute multiple Data Actions and perform complex data transformation tasks.
  • Data Rules can be applied by:
    • The Apply Rules activity (must be done after data is collected by the Extract activity)
    • The Extract activity (will run after the Data Model extraction)
    • The Convert Data activity when converting document to another Document Type
    • They can be applied manually in a Data Viewer with the "Run Rule" command.

Extractor Nodes

Grooper.Core.ExtractorNode

Data Type

Grooper.Extract.DataType

pin Data Types are nodes used to extract text data from a document. Data Types have more capabilities than quick_reference_all Value Readers. Data Types can collect results from multiple extractor sources, including a locally defined extractor, child extractor nodes, and referenced extractor nodes. Data Types can also collate results using Collation Providers to combine, sift and manipulate results further.

Value Reader

Grooper.Extract.ValueReader

quick_reference_all Value Reader nodes define a single data extraction operation. Each Value Reader executes a single Value Extractor configuration. The Value Extractor determines the logic for returning data from a text-based document or page. (Example: Pattern Match is a Value Extractor that returns data using regular expressions).

  • Value Readers are can be used on their own or in conjunction with pin Data Types for more complex data extraction and collation.

Field Class

Grooper.Extract.FieldClass

input Field Classes are NLP (natural language processing) based extractor nodes. They find values based on some natural language context near that value. Values are positively or negatively associated with text-based "features" nearby by training the extractor. During extraction, the extractor collects values based on these training weightings.

  • Field Classes are most useful when attempting to find values within the flow of natural language.
  • Field Classes can be configured to distinguish values within highly structured documents, but this type of extraction is better suited to simpler "extractor nodes" like quick_reference_all Value Readers or pin Data Types.
  • Advances in large-language models (LLMs) have largely made Field Classes obsolete. LLM-based extraction methods in Grooper (such as AI Extract) can achieve similar results with nowhere near the amount of set up.

File Store

Grooper.FileStore

hard_drive File Store nodes are a key part of Grooper's "database and file store" architecture. They define a storage location where file content associated with Grooper nodes are saved. This allows processing tasks to create, store and manipulate content related to documents, images, and other "files".

  • Not every node in Grooper will have files associated with it, but if it does, those files are stored in the Windows folder location defined by the File Store node.

Folder

Grooper.Folder

Batches Folder

Grooper.Core.BatchesFolder

Projects Folder

Grooper.ProjectsFolder

Machines Folder

Grooper.MachinesFolder

Local Resources Folder

Grooper.Core.LocalResourcesFolder

IP Elements

Grooper.IP.IpElement

IP Group

Grooper.IP.IpGroup

gallery_thumbnail IP Groups are containers of image IP Steps and/or IP Groups that can be added to perm_media IP Profiles. IP Groups add hierarchy to IP Profiles. They serve two primary purposes:

  1. They can be used simply to organize IP Steps for IP Profiles with large numbers of steps.
  2. They are often used with "Should Execute Expressions" and "Next Step Expressions" to conditionality execute a sequence of IP Steps.

IP Profile

Grooper.IP.IpProfile

perm_media IP Profiles are a step-by-step list of image processing operations (IP Commands). They are used for several image processing related operations, but primarily for:

  1. Permanently enhancing an image during the Image Processing activity (usually to get rid of defects in a scanned image, such as skewing or borders).
  2. Cleaning up an image in-memory during the Recognize activity without altering the image to improve OCR accuracy.
  3. Computer vision operations that collect layout data (table line locations, OMR checkboxes, barcode value and more) utilized in data extraction.

IP Step

Grooper.IP.IpStep

image IP Steps are the basic units of an perm_media IP Profile. They define a single image processing operation, called an IP Command in Grooper.

Lexicon

Grooper.Core.Lexicon

dictionary Lexicons are dictionaries used throughout Grooper to store lists of words, phrases, weightings for Fuzzy RegEx, and more. Users can add entries to a Lexicon, Lexicons can import entries from other Lexicons by referencing them, and entries can be dynamically imported from a database using a database Data Connection. Lexicons are commonly used to aid in data extraction, with the "List Match" and "Word Match" extractors utilizing them most commonly.

Machine

Grooper.Machine

computer Machine nodes represent servers that have connected to the Grooper Repository. They are essential for distributing task processing loads across multiple servers. Grooper creates Machine nodes automatically whenever a server makes a new connection to a Grooper Repository's database. Once added, Machine nodes can be used to view server information and to manage Grooper Service instances.

OCR Profile

Grooper.OCR.OcrProfile

library_books OCR Profiles store configuration settings for optical character recognition (OCR). They are used by the Recognize activity to convert images of text on contract Batch Pages into machine-encoded text. OCR Profiles are highly configurable, allowing fine-grained control over how OCR occurs, how pre-OCR image cleanup occurs, and how Grooper's OCR Synthesis occurs. All this works to the end goal of highly accurate OCR text data, which is used to classify documents, extract data and more.

Object Library

Grooper.ObjectLibrary

extension Object Library nodes are .NET libraries that contain code files for customizing the Grooper's functionality. These libraries are used for a range of customization and integration tasks, allowing users to extend Grooper's capabilities.

Examples include:
  • Adding custom Activities that execute within Batch Processes
  • Creating custom commands available during the Review activity and in the Design page.
  • Defining custom methods that can be called from code expressions on Data Field and Batch Process Step objects.
  • Creating custom Connection Types for CMIS Connections for import/export operations from/to CMS systems.
  • Establish custom Grooper Services that perform automated background tasks at regular intervals

Project

Grooper.Project

package_2 Projects are the primary containers for configuration nodes within Grooper. The Project is where various processing objects such as stacks Content Models, settings Batch Processes, profile objects are stored. This makes resources easier to manage, easier to save, and simplifies how node references are made in a Grooper Repository.

Resource File

Grooper.ResourceFile

Resource Files are nodes you can add to a package_2 Project and store any kind of file. Each Resource File stores one file. While you can use Resource Files to store any kind of file in a Project, there are several areas in Grooper that can reference Resource Files to one end or another, including XML schema files used for Grooper's XML Schema Integration.

Root

Grooper.GrooperRoot

The Grooper database Root node is the topmost element of the Grooper Repository. All other nodes in a Grooper Repository are its children/descendants. The Grooper Root also stores several settings that apply to the Grooper Repository, including the license serial number or license service URL and Repository Options.

Scanner Profile

Grooper.Capture.ScannerProfile

scanner Scanner Profiles store configuration settings for operating a document scanner. Scanner Profiles provide users operating the Scan Viewer in the Review activity a quick way to select pre-saved scanner configurations.

Separation Profile

Grooper.Capture.SeparationProfile insert_page_break Separation Profiles store settings that determine how contract Batch Pages are separated into folder Batch Folders. Separation Profiles can be referenced in two ways:

  • In a Review activity's Scan Viewer settings to control how pages are separated in real time during scanning.
  • In a Separate activity as an alternative to configuring separation settings locally.

Work Queue

Grooper.Core.WorkQueue

Processing Queue

Grooper.Core.ThreadPool

memory Processing Queues help automate "machine performed tasks" (Those are Code Activity tasks performed by computer Machines and their Activity Processing services). Processing Queues are assigned to Batch Process Steps to distribute tasks, control the maximum processing rate, and set the "concurrency mode" (specifying if and how parallelism can occur across one or more servers).

  • Processing Queues are used to dedicate Activity Processing services with a capped number of processing threads to resource intensive activities, such as Recognize. That way, these compute hungry tasks won't gobble up all available system resources.
  • Processing Queues are also used to manage activities, such as Render, who can only have one activity instance running per machine (This is done by changing the queue's Concurrency Mode from "Maximum" to "Per Machine").
  • Processing Queues are also used to throttle Export tasks in scenarios where the export destination can only accept one document at a time.

Review Queue

Grooper.Core.ReviewQueue

person_play Review Queues help organize and filter human-performed Review activity tasks. User groups are assigned to each Review Queue, which is then set either on a settings Batch Process or a Review step. Based on a user's membership in Review Queues, this will affect how inventory_2 Batches are distributed in the Batches page and how Review tasks are distributed in the Tasks page.

Core Configuration Types

In Grooper, nodes are configured by editing their property settings. The following are configurable items that are considered a "core" part of Grooper. These objects are designed to be part of a larger configuration.

  • These "core configuration types" are found most commonly in the property settings on a node in the Grooper node tree.
  • However, they may also be configured when configuring commands or as part of a larger property configuration.

This includes:

  • Scripting/Advanced user info: These objects inherit from a base class called "Embedded Object". This is includes a large number of objects that exist as configurable properties.

Activity

Grooper.Core.BatchProcessingActivity

Grooper Activities define specific document processing operations done to a inventory_2 Batch, folder Batch Folder, or contract Batch Page. In a settings Batch Process, each edit_document Batch Process Step executes a single Activity (determined by the step's "Activity" property).

  • Batch Process Steps are frequently referred by the name of their configured Activity followed by the word "step". For example: "Classify step".

Attended Activities

Grooper.Core.AttendedActivity

Attended Activities are type of Activity in Grooper that require direct user interaction within a settings Batch Process workflow. Attended Activities are designed for steps where human review, validation or intervention is necessary (or automated processing is simply insufficient). The only current Attended Activity in Grooper is person_search Review.

Review

Grooper.Activities.Review

person_search Review is an Activity that allows user attended review of Grooper's results. This allows human operators to validate processed contract Batch Page and folder Batch Folder content using specialized user interfaces called "Viewers". Different kinds of Viewers assist users in reviewing Grooper's image processing, document classification, data extraction and operating document scanners.

Code Activities

Grooper.Core.CodeActivity

AI Dialogue

BE AWARE: AI Analysts and AI Dialogue are obsolete as of version 2025. This Activity only exists in version 2024. network_intelligence_update AI Dialogue is an Activity that executes a scripted conversation with an psychology AI Analyst and saves the resulting conversion on the document as a JSON file.

Apply Rules

Grooper.Activities.ApplyRules

flowsheet Apply Rules is an Activity that runs flowsheet Data Rules on data that has previously been extracted from documents (folder Batch Folders).

  • The Apply Rules activity will always need to run after an Extract activity runs (An Extract step must come before an Apply Rules step in the order of edit_document Batch Process Steps in a settings Batch Process).

Attach

Grooper.GPT.Attach

file_present Attach is an Activity that physically moves and nests documents within a folder Batch Folder based on attachment markers set by the attach_file_add Mark Attachments activity. It consolidates related documents—such as addenda or supporting documents—under their host documents, updating the inventory_2 Batch hierarchy for downstream processing.

Batch Transfer

Grooper.Activities.BatchTransfer

Template:BatchTransferIcon Batch Transfer is an Activity that

Burst Book

Grooper.Microform.BurstBook

auto_stories Burst Book is an Activity that

Classify

Grooper.Activities.ClassifyFolders

unknown_document Classify is an Activity that "classifies" folder Batch Folders in a inventory_2 Batch by assigning them a description Document Type.

  • Classification is key to Grooper's document processing. It affects how data is extracted from a document (during the Extract activity) and how Behaviors are applied.
  • Classification logic is controlled by a Content Model's "Classify Method". These methods include using text patterns, previously trained document examples, and Label Sets to identify documents.

Clip Frames

view_module Clip Frames is a specialized Activity for processing microfiche in Grooper. It extracts defined areas from microfiche card images, creating new image frames or layers for focused analysis or processing.

Convert Data

switch_access_2 Convert Data is an Activity that converts a document (folder Batch Folder) to another description Document Type using Data Actions to copy and convert Data Elements from the source Document Type to those in the target Document Type. Convert Data is a specialized Activity for use cases requiring a great deal of data transformation before export.

Correct

abc Correct is an Activity that performs spell correction. It can correct a folder Batch Folder's text content or specific Data Element values to resolve OCR errors, deidentify data or otherwise enhance text data.

Deduplicate

Template:DeduplicateIcon Deduplicate is an Activity that

Detect Frames

view_module Detect Frames is a specialized Activity for processing microfiche in Grooper. It locates and identifies frame lines on microfiche card images, enabling the isolation of areas within the frames for further data extraction or processing.

Detect Language

Grooper.GPT.DetectLanguage

travel_explore Detect Language is an Activity that uses a large language model (LLM) to determine the primary language (English, Spanish, French, etc.) of a document. Activities executed downstream, such as export_notes Extract, can use this information to apply language specific logic.

Execute

tv_options_edit_channels Execute is an Activity that runs one or more specified object commands. This gives access to a variety of Grooper commands in a settings Batch Process for which there is no Activity, such as the "Sort Children" command for Batch Folders or the "Expand Attachments" command for email attachments.

Export

output Export is an Activity that transfers documents and extracted information to external file systems and content management systems, completing the data processing workflow.

Extract

export_notes Extract is an Activity that retrieves information from folder Batch Folder documents, as defined by Data Elements in a data_table Data Model. This is how Grooper locates unstructured data on your documents and collects it in a structured, usable format.

Image Processing

wallpaper Image Processing is an Activity that enhances contract Batch Page images and optimizes them for better OCR text recognition and data extraction results.

Initialize Card

view_module Initialize Card is a specialized Activity for processing microfiche in Grooper. It prepares and configures microfiche card images for further processing.

Launch Process

Template:LaunchProcessIcon Launch Process is an Activity that

Mark Attachments

Grooper.GPT.MarkAttachments

attach_file_add Mark Attachments is an Activity that analyzes documents (folder Batch Folders) to determine attachment relationships using configurable rules ("Attachment Rules"). It sets attachment markers on documents—indicating whether they should be attached to neighboring Batch Folders. These markers are then used by the Attach activity to group and nest related documents.

Merge

file_save Merge is an Activity that creates a PDF, TIF, XML or ZIP file from the page and data content of a Batch Folder and saves it to that Batch Folder.

Recognize

format_letter_spacing_wide Recognize is an Activity that obtains machine-readable text from contract Batch Pages and folder Batch Folders. When properly configured with an library_booksOCR Profile, Recognize will selectively perform OCR for images and native-text extraction for digital text in PDFs. Recognize can also reference an perm_mediaIP Profile to collect "layout data" like lines, checkboxes, and barcodes. Other Activities then use this machine-readable text and layout data for document analysis and data extraction.

Redact

format_ink_highlighter Redact is an Activity that visibly obscures (or "redacts") text information on an page based on results returned from a extractor. Be aware, Redact does not alter the text data. It only alters the image.

Remove Level

account_tree Remove Level is an Activity that

Render

print Render is an Activity that converts files of various formats to PDF. It does this by digitally printing the file to PDF using the Grooper Render Printer. This normalizes electronic document content from file formats Grooper cannot read natively to PDF (which it can read natively), allowing Grooper to extract the text via the format_letter_spacing_wide Recognize Activity.

Route

alt_route Route is an Activity that

Send Mail

forward_to_inbox Send Mail is an Activity automates email notifications from Grooper based on events and conditions set by a settings Batch Process. Optionally, documents in the inventory_2 Batch may be attached to the generated email.

Separate

insert_page_break Separate is an Activity that sorts contract Batch Pages into individual folder Batch Folders. This distinguishes "loose pages" from the documents formed by those pages. Once loose pages are separated into Batch Folder documents, they can be further processed by unknown_document Classify, export_notes Extract, output Export and other Activities that need to run on the folder (i.e. document) level.

Spawn Batch

inventory_2 Spawn Batch is an Activity that

Split Pages

Multi-page PDF and TIF files come into Grooper as files attached to single folder Batch Folders. Split Pages is an Activity that creates child contract Batch Pages for each page in the PDF or TIF. This allows Grooper to process and handle these pages as individual objects.

Split Text

receipt Split Text is an Activity that

Text Transform

insert_text Text Transform is an Activity that

Train Lexicon

book_2 Train Lexicon is an Activity that

Translate

translate Translate is an Activity that

XML Transform

code_blocks XML Transform is an Activity that applies XSLT stylesheets to XML data to modify or reformat the output structure for various purposes.

Behavior

A "Behavior" is one of several features applied to a Content Type (such as a description Document Type). Behaviors affect how certain Activities and Commands are executed, based how a document (folder Batch Folder) is classified. They behave differently, according to their Document Type. This includes how they are exported (how Export behaves), if and how they are added to a document search index (how the various indexing commands behave), and if and how Label Sets are used (how Classify and Extract behave in the presence of Label Sets).

  • Each Behavior is enabled by adding it to a Content Type. They are configured in the Behaviors editor.
  • Behaviors extend to descendent Content Types, if the descendent Content Types has no Behavior configuration of its own.
    • For example, all Document Types will inherit their parent Content Model's Behaviors.
    • However, if a Document Type has its own Behavior configuration, it will be used instead.

Export Behavior

An Export Behavior defines the parameters for exporting classified folder Batch Folder content from Grooper to other systems. This includes where they are exported to (what content management system, file system, database etc), what content is exported (attached files, images, and/or data), how it is formatted (PDF, CSV, XML etc), folder pathing, file naming and data mappings (for Data Export and CMIS Export).

Import Behavior

An Import Behavior defines how data is mapped from files in an external content management system to Batch Folders created on import when using CMIS Import.

Indexing Behavior

An Indexing Behavior allows documents (folder Batch Folders) to be indexed via AI Search. Once indexed, users can search for and retrieve documents from the Search Page.

Labeling Behavior

A Labeling Behavior extends "label set" functionality to description Document Types. This allows you to collect field labels and other labels present on a document and use them in a variety of ways. This includes functionality for classification, field extraction, table extraction, and section extraction.

PDF Data Mapping

PDF Data Mapping is a Behavior that enhances PDF files generated by the Merge or Export activities with metadata, bookmarks, annotations and/or different kinds of widgets.

Text Rendering

Text Rendering is a Behavior that causes text documents (e.g. TXT files) to be interpreted and displayed as paginated documents rather than a raw text stream.

  • By default, this renders TXT files to an 8.5 by 11 inch page format, but this can be altered in the Text Rendering settings.

Classify Method

"Classify Methods" define classification logic used by stacks Content Models during the unknown_document Classify activity. Classify Methods organize document content in Grooper by assigning folder Batch Folders a description Document Type.

  • Classify Methods analyze documents (Batch Folders) to determine what kind of document it is.
  • Each Classify Methods analyzes documents according to different methodologies to organize documents accurately. This includes text-based pattern matching, computer vision, machine learning models, label sets and more.
  • Classify Methods are configured by setting and configuring a Content Model's "Classification Method" property.

GPT Embeddings

BE AWARE: GPT Embeddings is obsolete as of version 2025. The LLM Classifier and Search Classifier methods are the new and improved AI-enabled classification methods. GPT Embeddings is a Classify Method that uses an OpenAI embeddings model and trained document samples to tell one document from another.

Labelset-Based

"Labelset-Based" is a Classify Method that leverages the labels defined via a Labeling Behavior to classify folder Batch Folders.

Lexical

"Lexical" is a Classify Method that classifies folder Batch Folders based on the text content of trained document examples. This is achieved through the statistical analysis of word frequencies that identify description Document Types.

LLM Classifier

"LLM Classifier" is a Classify Method that classifies documents (folder Batch Folders) by asking a large language model (LLM) to select its description Document Type from a list.

Rules-Based

"Rules-Based" is a Classify Method that employs "rules" defined on each description Document Type to classify folder Batch Folders. Positive Extractor and Negative Extractor properties are configured for each Document Type to positively or negatively associate a Batch Folder based on predefined criteria.

  • Where the Positive and Negative Extractors will impact all Classify Method results, the Rules-Based method classifies using only these properties and nothing else.

Search Classifier

"Search Classifier" is a Classify Method that classifies documents (folder Batch Folders) by finding similar documents in a document search index. The Search Classifier method uses an embeddings model and vector similarity to give an unclassified document the same description Document Type as its closest match in the search index.

Visual

"Visual" is a Classify Method that uses image analysis instead of text data to determine the description Document Type assigned to a folder Batch Folder during classification. Instead of using text-based extractors, an "Extract Features" IP Command in an perm_media IP Profile is used to collect image-based data from a Batch Folder's image(s). This image-based data is compared against that of previously trained document examples of each Document Type to classify the Batch Folder.

IP Command

IP Commands specify an image processing (IP) operation (such as image cleanup, format conversion or feature detection) and are used to construct image IP Steps in an IP Profile. IP Commands are configured using an IP Step's Command property.

Barcode Detection

Barcode Detection is an IP Command that detects and reads barcode data. The detected barcode information is stored as part of the page's layout data.

Barcode Removal

Barcode Removal is an IP Command that detects, reads and digitally removes barcodes from an image. The detected barcode information is stored as part of the page's layout data.

Binarize

Binarize is an IP Command that converts a color or grayscale image to a bi-tonal (black and white) image using various thresholding methods.

Box Detection

Box Detection is an IP Command that detects checkboxes and determines their check state (checked or unchecked). The detected checkbox information is stored as part of the page's layout data.

Box Removal

Box Removal is an IP Command that detects checkboxes, determines their check state (checked or unchecked) and digitally removes them from an image. The detected checkbox information is stored as part of the page's layout data.

Extract Page

Extract Page is an IP Command that removes an image from a carrier image while simultaneously removing any image warping or skewing.

Line Detection

Line Detection is an IP Command that locates horizontal and vertical lines on documents. The detected line locations are stored as part of page's layout data.

Line Removal

Line Removal is an IP Command that locates and removes horizontal and vertical lines from documents. The detected line locations are stored as part of page's layout data.

Scratch Removal

Scratch Removal is an IP Command detects and removes or repairs scratches from film-based images.

Shape Detection

Shape Detection is an IP Command that locates shapes on a document that match one or more sample images. Common shapes targeted by this command are stamps, seals, logos or other graphical marks that can serve as triggers for document separation or anchors for data extraction. Shapes The detected shapes' locations are stored as part of page's layout data.

Shape Removal

Shape Removal is an IP Command detects and removes shapes from documents. Common shapes targeted by this command are stamps, seals, logos or other graphical marks that interfere with OCR and/or can serve as triggers for document separation or anchors for data extraction. The detected shapes' locations are stored as part of page's layout data.

OCR Engine

An "OCR engine" is the part of OCR software that recognizes text from images. OCR engines analyze the image's pixels to determine where text is on the page and what each character is. In Grooper, OCR engines are selected when configuring an OCR Profile's OCR Engine property.

Azure OCR

Azure OCR is an OCR Engine option for OCR Profiles that utilizes Microsoft Azure's Read API. Azure's Read engine is an AI-based text recognition software that uses a convolutional neural network (CNN) to recognize text. Compared to traditional OCR engines, it yields superior results, especially for handwritten text and poor quality images. Furthermore, Grooper supplements Azure's results with those from a traditional OCR engine in areas where traditional OCR is better than the Read engine.

Repository Option

Repository Options are optional features that affect the entire repository. These optional features enable functionality that otherwise do not work without first establishing the connections these options provide. Repository Options are added to a Grooper Repository and configured using the database Root node's Options property.

LLM Connector

LLM Connector is a Repository Option that enables large language model (LLM) powered AI features for a Grooper Repository.

AI Search

AI Search is a Repository Option that enables Grooper's document search and retrieval features in the Search page. Once enabled, Indexing Behaviors can be added to Content Types (such as stacks Content Models), which will allow users to submit documents to a search index. Once indexed, documents can be retrieved by full text and metadata searches in the Search Page.

Separation Provider

The Provider property of the Separate Activity defines the type of separation to be performed at the designated Scope.

Change in Value Separation

The Change in Value Separation Separation Provider creates a new folder and separates every time an extracted value changes from one contract Batch Page to another.

Control Sheet Separation

Control Sheet Separation is a Separation Provider that uses Grooper document_scanner Control Sheets to separate documents.

EPI Separation

The EPI Separation Separation Provider uses embedded page information ("EPI") to Separate loose pages into document folders. A Data Extractor is used to find page numbers from the text on a page and Grooper uses this information to separate the pages.

ESP Auto Separation

ESP Auto Separation is a Separation Provider used for document separation. It is unique in that it both separates and classifies documents at the same time. It uses page-level classification training examples (among other things) to determine where to insert document folders in a inventory_2 Batch.

Event-Based Separation

Event-Based Separation is a Separation Provider that Separates documents using one or more "Separation Events". Each Separation Event triggers the creation of a new folder.

Multi Separator

The Multi Separator Separation Provider performs separation using multiple Separation Providers. It allows users to create a list of any of the other Separation Providers. If the first provider on the list fails to separate a page (or, as more often is the case, a series of pages), the next one will be applied. If that fails, the next, and so on.

Pattern-Based Separation

Pattern-Based Separation is a Separation Provider that creates a new document folder every time a value returned by a defined pattern is encountered on a page.

Undo Separation

Undo Separation is a Separation Provider. Instead of putting loose contract Batch Pages into folder Batch Folders, this Separation Provider removes Batch Folders, leaving only loose pages.

Service

Grooper.ServiceInstance

Grooper Services are various executable programs that run as a Windows Service to facilitate Grooper processing. Service instances are installed, configured, started and stopped using Grooper Command Console (or in older Grooper versions, Grooper Config).

Activity Processing

Grooper.Services.ActivityProcessing

Activity Processing is a Grooper Service that executes Activities assigned to edit_document Batch Process Steps in a settings Batch Process. This allows Grooper to automate Batch Steps that do not require a human operator.

API Services

Grooper.Services.ApiServices

You can perform inventory_2 Batch processing via REST API web calls by installing API Services.

  • As of version 2025, the Grooper Web Services (GWS) web app hosts additional API endpoints. Some of these endpoints overlap with the API Services endpoints. Refer to the GWS documentation for more information on its endpoint offerings. You can locate the GWS documentation for your Grooper install at https://{webserver-name-or-domain-name}/GWS

Grooper Licensing

Grooper.Services.LicenseService

Grooper Licensing is a Grooper Service that distributes licenses to multiple workstations running Grooper applications.

Import Watcher

Grooper.Services.ImportWatcher

An Import Watcher is a Grooper Service that schedules and runs Import Jobs. It uses an Import Provider to query files in a file system or content management system that meet specified criteria according to a defined schedule (every minute, every day, only on Sundays, etc.). These files are imported into Grooper as documents (folder Batch Folders) in a new inventory_2 Batch.

  • Afterward, the imported files can be (and should be) moved, deleted, or modified to prevent repeat imports in the next polling cycle.

Indexing Service

Grooper.GPT.IndexingService

An Indexing Service is a Grooper Service that periodically polls the Grooper database to automate AI Search indexing. It checks to see if any documents in a Grooper Repository are classified as a Document Type that inherit from a Content Type configured with an Indexing Behavior. If there are any, and they need to be added, updated, or deleted to/from the search index, the Indexing Service will submit an "Indexing Job" to be picked up by an Activity Processing service.

Extraction Related Types

These are configuration objects in Grooper that relate to extracting data from documents. These objects include specialized items such as "Table Extract Methods" which pertain only to configuring Data Table nodes. These also include more general items such as Value Extractors which are used by various extractor related properties on a variety of node types in Grooper.

These "extraction related types" are always found when configuring properties of:


This includes:

  • Scripting/Advanced user info: These objects inherit from a base class called "Embedded Object". This is includes a large number of objects that exist as configurable properties.

Collation Provider

The Collation property of a pin Data Type defines the method for converting its raw results into a final result set. It is configured by selecting a Collation Provider. The Collation Provider governs how initial matches from the Data Type's extractor(s) are combined and interpreted to produce the Data Type's final output.

AND

AND is a Collation Provider option for pin Data Type extractors. AND returns results only when each of its referenced or child extractors gets at least one hit, thus acting as a logical “AND” operator across multiple extractors.

Array

Array is a Collation Provider option for pin Data Type extractors. Array matches a list of values arranged in horizontal, vertical, or text-flow order, combining instances that qualify into a single result.

Combine

Combine is a Collation Provider option for pin Data Type extractors. Combine combines instances from returned results based on a specified grouping, controlling how extractor results are assembled together for output.

Key-Value List

Key-Value List is a Collation Provider option for pin Data Type extractors. Key-Value List matches instances where a key and a list of one or more values appear together on the document, adhering to a specific layout pattern.

Key-Value Pair

Key-Value Pair is a Collation Provider option for pin Data Type extractors. Key-Value Pair matches instances where a key is paired with a value on the document in a specific layout. Note: Key-Value Pair is an older technique in Grooper. In most cases, the Labeled Value extractor is preferable to Key-Value Pair collation.

Multi-Column

Multi-Column is a Collation Provider option for pin Data Type extractors. Multi-Column combines multiple columns on a page into a single column for extraction.

Ordered Array

Ordered Array is a Collation Provider option for pin Data Type extractors. Ordered Array finds sequences of values where one result is present for each extractor, in the order they appear, according to a specified horizontal, vertical or text-flow layout.

Pattern-Based

Pattern-Based is a Collation Provider option for pin Data Type extractors. Pattern-Based uses regular expressions to sequence returned results into a final result set.

Split

Split is a Collation Provider option for pin Data Type extractors. Split separates a data instance at each match returned by the Data Type. The results are used as anchor points to "split" text into one or more smaller parts.

Fill Method

Fill Methods provide various mechanisms for populating child Data Elements of a data_table Data Model, insert_page_break Data Section or table Data Table. Fill Methods can be added to these nodes using their "Fill Methods" property and editor.

  • Fill Methods are secondary extraction operations. They populate descendant Data Elements after normal extraction when the export_notes Extract activity runs.

AI Extract

Grooper.GPT.AIExtract

AI Extract is a Fill Method that leverages a Large Language Model (LLM) to return extraction results to Data Elements in a data_table Data Model or insert_page_break Data Section. This mechanism provides powerful AI-based data extraction with minimal setup.

Fill Descendants

Grooper.GPT.FillDescendants

Fill Descendants is a Fill Method that executes any Fill Methods on child Data Elements in parallel. This has been shown to dramatically increase efficiency on larger data_table Data Models with multiple insert_page_break Data Sections using AI Extract.

Run Child Extractors

Grooper.Core.RunChildExtractors

Run Child Extractors is a Fill Method that executes extraction for a subset of child Data Elements. This allows you to selectively run extraction logic for one or more Data Elements in a data_table Data Model, insert_page_break Data Section, or table Data Table.

Section Extract Method

The Extract Method property of a insert_page_break Data Section defines a "Section Extract Method" which specifies how section instances will be identified and extracted.

Clause Detection

Clause Detection is a insert_page_break Data Section Extract Method. It leverages LLM text embedding models to compare supplied samples of text against the text of a document to return what the AI determines is the "chunk" of text that most closely resembles the supplied samples.

Nested Table

Nested Table is a insert_page_break Data Section Extract Method. This method divides a document into sections by extracting table data within those sections. This gives Grooper users a method for extracting hierarchical tables as well as dividing up a document into sections where each of those sections have the same table (or at least tabular data which can be extracted by a single table Data Table object).

Transaction Detection

Transaction Detection is a insert_page_break Data Section Extract Method. This extraction method produces section instances by detecting repeating patterns of text around the Data Section's child variables Data Fields.

Lookup Specification

A Lookup Specification defines a "lookup operation", where existing Grooper fields (called "lookup fields") are used to query an external data source, such as a database. The results of the lookup can be used to validate or populate field values (called "target fields") in Grooper. Lookup Specifications are created on "container elements" (data_table Data Models, insert_page_break Data Sections and table Data Tables) using their Lookups property. Lookups may query using all single-instance fields relative to the container element (including those defined on parent elements up to the root Data Model), but cannot be used to populate a field value on a parent of the container element.

CMIS Lookup

CMIS Lookup is a Lookup Specification that performs a lookup against a settings_system_daydream CMIS Repository via a "CMISQL query" (a specialized query language based on SQL database queries).

Database Lookup

Database Lookup is a Lookup Specification that performs a lookup against a database Data Connection via a SQL query.

GPT Lookup

PLEASE NOTE: GPT Lookup is obsolete as of version 2025. Much of its functionality was replaced by newer and better LLM-based extraction methods, such as AI Extract. If absolutely necessary, its functionality could also be replicated with a Web Service Lookup implementation. GPT Lookup is a Lookup Specification that performs a lookup using an OpenAI GPT model.

Lexicon Lookup

Lexicon Lookup is a Lookup Specification that performs a lookup against a dictionary Lexicon.

Web Service Lookup

Web Service Lookup is a Lookup Specification that looks up external data at an API endpoint by calling a web service.

XML Lookup

XML Lookup is a Lookup Specification that performs a lookup against an XML file stored as a draft Resource File in the package_2 Project. XML Lookups use XPath expressions to select XML nodes and map XML attributes or an XML element's text to Grooper fields.

Table Extract Method

A Table Extract Method defines the settings and logic for a table Data Table to perform extraction. It is set by configuring the Extract Method property of the Data Table.

Delimited Extract

The Delimited Extract Table Extract Method extracts tabular data from a delimiter-separated text file, such as a CSV file.

Fluid Layout

The Fluid Layout Table Extract Method will choose between Tabular Layout and Flow Layout configurations, depending on how labels are collected for a description Document Type.

Grid Layout

The Grid Layout Table Extract Method uses the positional location of row and column headers to interpret where a tabular grid would be around each value in a table and extract values from each cell in the interpreted grid.

Row Match

The Row Match Table Extract Method uses regular expression pattern matching to determine a tables structure based on the pattern of each row and extract cell data from each column.

Tabular Layout

The Tabular Layout Table Extract Method uses column header values determined by the view_column Data Columns Header Extractor results (or labels collected for the Data Columns when a Labeling Behavior is enabled) as well as Data Column Value Extractor results to model a table's structure and return its values.

Value Extractor

Grooper.Core.ValueExtractor

Value Extractors define an operation that reads data from the text (and sometimes visual) content of a page or document. There are over 20 unique Value Extractors, each using specialized logic to return results. Value Extractors are consumed by multiple higher-level objects in Grooper (such as Data Elements, Extractor Nodes, various Activities and more) to perform a diverse set of document processing duties.

  • Value Extractors return a list of one or more "data instances". Data instances contain both the value and its page location, which allows Grooper to highlight results in a Document Viewer.

Ask AI

Grooper.GPT.OpenAI.Chat.AskAI

Ask AI is a Value Extractor that executes a chat completion using a large language model (LLM), such as OpenAI's GPT models. It uses a document's text content and user-defined instructions (a question about the document) in the chat prompt. Ask AI then returns the response as the extractor's result. Ask AI is a powerful, LLM-based extraction method, that can be used anywhere in Grooper a Value Extractor is referenced. It can complete a wide array of tasks in Grooper with simple text prompts.

Detect Signature

Grooper.Extract.DetectSignature

Detect Signature is a Value Extractor that cant detect if a handwritten signature is present on a document. It detects signatures within a specified rectangular region on a document page by measuring the "fill percentage" (what percentage of pixels are filled in the region).

Field Match

Grooper.Extract.FieldMatch

Field Match is a Value Extractor that matches the value stored in a previously-extracted variables Data Field or view_column Data Column.

Find Barcode

Grooper.Extract.FindBarcode

Find Barcode is a Value Extractor that searches for and returns barcode values previously stored in a folder Batch Folder or contract Batch Page's layout data.

  • Note: Find Barcode differs slightly from Read Barcode. Read Barcode performs barcode recognition when the extractor executes. Find Barcode can only look up barcode data stored in the document or page's layout data. Find Barcode runs quicker than Read Barcode, but barcode values must have previously been collected in the Batch Process by the Image Processing or Recognize activities.

GPT Complete

Removed in version 2025

GPT Complete is a Value Extractor that leverages Open AI's GPT models to generate chat completions for inputs, returning one hit for each result choice provided by the model's response.

PLEASE NOTE: GPT Complete is a deprecated Value Extractor. It uses an outdated method to call the OpenAI API. Please use the Ask AI extractor going forward.

Highlight Zone

Grooper.Extract.HighlightZone

Highlight Zone is a Value Extractor that sets a highlight region on a document without performing any actual data extraction. This "extractor" is used to mark areas of interest or importance for Review users or for uncommon scenarios where a data instance location is needed with no actual value.

Label Match

Grooper.Extract.LabelMatch

Label Match is a Value Extractor that matches a list of one or more values using matching options defined by a Labeling Behavior. It is similar to List Match but uses shared settings defined in a Labeling Behavior for Fuzzy Matching, Vertical Wrap, and Constrained Wrap.

Labeled OMR

Grooper.Extract.LabeledOMR

Labeled OMR is a Value Extractor used to output OMR checkbox labels. It determines whether labeled checkboxes are checked or not. If checked, it outputs the label(s) or a Boolean true/false value as the result.

Labeled Value

Grooper.Extract.LabeledValue

Labeled Value is a Value Extractor that identifies and extracts a value next to a label. This is one of the most commonly used extractors to extract data from structured documents (such as a standardized form) and static values on semi-structured documents (such as the header details on an invoice).

List Match

Grooper.Extract.ListMatch

List Match is a Value Extractor designed to return values matching one or more items in a defined list. By default, the List Match extractor does not use or require regular expression, but can be configured to utilize regular expression syntax.

Ordered OMR

Grooper.Extract.OrderedOMR

Ordered OMR is a Value Extractor used to return OMR check box information. Ordered OMR returns information for multiple check boxes within a defined zone based on their order and layout. The zone may be optionally fixed on the page or anchored to a static text value (such as a label).

Pattern Match

Grooper.Extract.PatternMatch

Pattern Match is a Value Extractor that extracts values from a document that match a specified regular expression, providing data collection following a known format or pattern.

Query HTML

Grooper.Messaging.QueryHTML

Query HTML is a Value Extractor specialized for HTML documents. It uses either CSS or XPath selectors to return the inner text or an attribute of an HTML element.

Read Barcode

Grooper.Extract.ReadBarcode

Read Barcode is a Value Extractor that uses barcode recognition technology to read and extract values from barcodes found in the document content.

  • Note: Read Barcode differs slightly from Find Barcode. Read Barcode performs barcode recognition when the extractor executes. Find Barcode can only look up barcode data stored in the document or page's layout data. Find Barcode runs quicker than Read Barcode, but barcode values must have previously been collected in the Batch Process by the Image Processing or Recognize activities.

Read Metadata

Grooper.Extract.ReadMetaData

Read Metadata is a Value Extractor retrieves metadata values associated with a document. Read Metadata can return metadata from a folder Batch Folder's attachment file based on its MIME type, such as PDF, Word and Mail Message ('message/rfc822' or 'application/vnd.ms-outlook'). It can also return data using a Document Link in Grooper, such as a File System Link or a CMIS Document Link.

Read Zone

Grooper.Extract.ReadZone

Read Zone is a Value Extractor that allows you to extract text data in a rectangular region (called an "extraction zone" or just "zone") on a document. This can be a fixed zone, extracting text from the same location on a document, or a zone relative to a text value (such as a label) or a shape location on the document.

Reference

Grooper.Extract.ReferenceExtractor

Reference is a Value Extractor used to reference an Extractor Node. This allows users to create re-usable extractors and use the more complex pin Data Type and input Field Class extractors throughout Grooper.

Word Match

Grooper.Extract.WordMatch

Word Match is a Value Extractor that extracts individual words or phrases from documents. It is used for n-gram extraction. Each gram may be optionally executed against a dictionary Lexicon to ensure words and phrases only match a set vocabulary.

Zonal OMR

Grooper.Extract.ZonalOMR

Zonal OMR is a Value Extractor that reads one or more OMR checkboxes using manually-configured zones. The zone may be optionally fixed on the page or anchored to a static text value (such as a label).

BE AWARE: Zonal OMR is outdated compared to Labeled OMR and Ordered OMR. It requires the most manual setup of any OMR extractor to configure. Use this as a last resort when other OMR extractor options have been exhausted.

Import and Export Related Types

These are configuration objects in Grooper that relate to importing documents into Grooper, exporting processed content (files and data) out of Grooper, and otherwise accessing document content linked in Grooper to external file systems and content management systems.

This includes:

Please Note: Import Behavior and Export Behavior are obviously import and export related. Because their parent type is "Behavior", they are found in the Core Configuration Types portion of this Glossary.

  • Scripting/Advanced user info: These objects inherit from a base class called "Embedded Object". This is includes a large number of objects that exist as configurable properties.

CMIS Binding

CMIS Bindings are the platform connection types for cloud CMIS Connections. The CMIS Binding establishes the communication protocols used to connect Grooper with content management systems (CMS) and file systems.

CMIS Bindings use the CMIS standard as a model to define connectivity. Even when connecting to CMS platforms that are not truly CMIS systems (such as a Windows file system), Grooper normalizes connection to them as if they were. This allows Grooper to use CMIS Import and CMIS Export for all storage platforms.

  • You will commonly hear CMIS Binding referred to as a "CMIS connection type", "connection type", or just "connection", as in an "Exchange connection".

AppXtender

AppXtender is a connection option for cloud CMIS Connections. It allows Grooper to connect to the AppEnhancer (formerly ApplicationXtender) content management system for import and export operations.

Box

Box is a connection option for cloud CMIS Connections. It Grooper to the Box content management system for import and export operations.

CMIS

CMIS is a connection option for cloud CMIS Connections. It connects Grooper to a CMIS 1.0 or CMIS 1.1 server for import and export operations. This can be used to connect to CMS platforms that implement the CMIS protocol such as these.

Exchange

Exchange is a connection option for cloud CMIS Connections. It connects Grooper to Microsoft Exchange email servers (including Outlook servers) for import and export operations.

FTP

FTP is a connection option for cloud CMIS Connections. It connects Grooper to FTP directories for import and export operations.

IMAP

IMAP is a connection option for cloud CMIS Connections. It connects Grooper to email messages and folders through an IMAP email server for import and export operations.

NTFS

NTFS is a connection option for cloud CMIS Connections. It connects Grooper to files and folders in the Microsoft Windows NTFS file system for import and export operations.

OneDrive

OneDrive is a connection option for cloud CMIS Connections. It connects Grooper to Microsoft OneDrive cloud services for import and export operations.

SFTP

SFTP is a connection option for cloud CMIS Connections. It connects Grooper to SFTP directories for import and export operations.

SharePoint

SharePoint is a connection option for cloud CMIS Connections. It Grooper to Microsoft SharePoint, providing access to content stored in "document libraries" and "picture libraries" for import and export operations.

Content Link

Grooper.Core.ContentLink

Content Links define references to files or folders stored outside of Grooper, such as in a Windows folder or in a CMIS Repository.

  • Content Link has two sub-types: Document Link and Folder Link. There are 9 types of "Document Link" and only 1 type of "Folder Link". Due to this, Document Link is a more common term than "Content Link".

Document Links

Grooper.Core.DocumentLink

CMIS Document Link

Grooper.CMIS.CmisLink

File System Link

Grooper.Core.FileSystemLink

FTP Link

Grooper.Messaging.FtpLink

HTTP Link

Grooper.Messaging.HTTPLink

Mail Link

Grooper.Messaging.MailLink

PST Link

Grooper.Office.PstLink

SFTP Link

Grooper.Messaging.SftpLink

Subfile Link

Grooper.Core.SubfileLink

ZIP Link

Grooper.Messaging.FtpLink

Folder Links

Grooper.Core.FolderLink

CMIS Folder Link

Grooper.CMIS.CmisFolderLink

Export Definition

Export Behaviors are defined by adding and configuring one or more Export Definitions (See Export Definition Types or the Export Definitions section of the Export article). An Export Definition defines export parameters to external systems, such as file systems, content management repositories, databases, or mail servers.

CMIS Export

CMIS Export is an Export Definition available when configuring an Export Behavior. It exports content over a cloud CMIS Connection, allowing users to export documents and their metadata to various on-premise and cloud-based storage platforms.

Data Export

Data Export is an Export Definition available when configuring an Export Behavior. It exports extracted document data over a database Data Connection, allowing users to export data to a Microsoft SQL Server or ODBC compliant database.

Import Provider

Grooper.Core.ImportProvider

Import Providers enable Grooper to import file-based content from numerous sources, including Windows file systems, SFTP file systems, mail servers and various content management systems (CMS). An Import Provider is selected and configured when configuring "Import Jobs". Import Jobs are submitted in one of two ways:

  • By a user from the Imports page: Ad-hoc or "user directed" Import Jobs are submitted from the Imports Page, using the "Submit Import Job" button.
  • From an Import Watcher service: Automated or "scheduled" Import Jobs are submitted by an Import Watcher service according to its Poling Loop or Specific Times specification.

In both cases, an Import Provider is selected and configured using using the "Provider" property.

CMIS Import

Grooper.CMIS.CmisImportBase

CMIS Import refers to two Import Providers used to import content from settings_system_daydream CMIS Repositories: Import Descendants and Import Query Results. CMIS Imports allow users to import from various on-premise and cloud based storage platforms (including Windows folders, Outlook inboxes, Box accounts, AppEnhancer applications and more).

Import Descendants

Grooper.CMIS.ImportDescendants

Import Descendants is one of two Import Providers that use cloud CMIS Connections to import document content into Grooper. Import Descendants imports files from a settings_system_daydream CMIS Repository folder location, including any files in any sub-folders (i.e. all "descendant" files).

Import Query Results

Grooper.CMIS.ImportQueryResults

Import Query Results is one of two Import Providers that use cloud CMIS Connections to import document content into Grooper. Import Query Results imports files from a settings_system_daydream CMIS Repository that match a "CMISQL query" (a specialized query language based on SQL database queries).

File System Import

Grooper.Core.FileSystemImport

File System Import refers to a Legacy Import Provider used to import documents directly from your Windows File System into Grooper.

HTTP Import

Grooper.Messaging.HTTPImport

HTTP Import is an Import Provider used to import web-based content (web pages and files hosted on an HTTP server). HTTP Import can be used to ingest individual web pages, defined portions of a website or entire websites into Grooper.

Test Batch

Grooper.Core.TestBatchImport

"Test Batch" is a specialized Import Provider designed to facilitate the import of content from an existing inventory_2 Batch in the test environment. This provider is most commonly used for testing, development, and validation scenarios, and is not intended for production use.

  • Looking for information on "production" vs "test" Batches in Grooper? See here.

Misc Properties and Other Configuration Types

AI Generator/Generators

AI Generators create custom documents using the results of a Search Page query and a large language model (LLM). Both document content and instructions are fed to the LLM to produce a text-based file.

  • AI Generators are added and configured using an Indexing Behavior's "Generators" property and editor. They are executed from the Search Page using the "Download" command and "Download Custom" format.

CMISQL Query/CMIS Query

Grooper.CMIS.CmisQuery

A CMISQL Query (aka CMIS Query) is Grooper's way of searching for documents in CMIS Repositories. Commonly, CMISQL Queries are used by Import Query Results to import documents from a CMIS Repository. CMISQL Queries are also used by CMIS Lookup to lookup data from a CMIS Repository. CMISQL Queries are based on a subset of the SQL-92 syntax for querying databases, with some specialized extensions added to support querying CMIS sources.

  • CMISQL Queries are configured using the "CMIS Query" property found in "Import Query Results" and "CMIS Lookup".

Paragraph Marker/Paragraph Marking

Grooper.Core.ParagraphMarker

Paragraph Marking is a component of Grooper's Text Preprocessor. It enables the "Paragraph Marker", which detects paragraph boundaries and marks them by altering the normal carriage return and new line feed pairs at the end of each line. Instead of placing like breaks at the end of each line, the Paragraph Marker places them at the end of each paragraph. This produces a normalized text flow, making it easier to extract values that span lines.

  • "Paragraph Marker" is the embedded object that actually performs paragraph detection and marking in Grooper. "Paragraph Marking" is the property that enables the Paragraph Marker and allows users to configure it.

Preprocessing/Text Preprocessor

Grooper.Core.TextPreprocessor

Grooper's "Text Preprocessor" adjusts how raw text is formatted before extraction. It manipulates control characters (such as CR/LF pairs) to allow regular expression patterns to match (or ignore) structural elements, such as line breaks, paragraph boundaries and tab markers. The Text Preprocessor executes the following:

Permission Set/Permission Sets

Grooper.PermissionSet

Permission Sets define security permissions in a Grooper Repository for a user or group. This allows you to restrict user access to specified Grooper pages (such as the Design Page) and Grooper Commands.

  • "Permission Set" is the embedded object that defines security principles. They are added to a Grooper Repository and configured using the "Permission Sets" property found on the database Root node.

Quoting Method/Document Quoting

Grooper.GPT.QuotingMethod

Quoting Methods provide various mechanisms to feed "quotes" from a document to an AI model for Grooper's LLM-based features. Quoting Methods control what text is fed to the AI, allowing users to feed the AI only the necessary context needed to respond or reduce costs by reducing the amount of input tokens sent to the LLM service. Depending on which Quoting Method is selected and configured, the quote may be the entire document text, a portion of a document's text, data extracted from the document, layout data, or a combination of this data.

  • "Quoting Method" is class of embedded objects that feed quotes to an LLM. Quoting Methods are selected and configured by various items (including AI Extract) using a "Document Quoting" property.

Variable Definition

Grooper.Core.VariableDefinition

Variable Definitions define a variable with a computed value that can be called by various code expressions. Variable Definitions are added to Data Models, Data Sections and Data Tables using their "Variables" property

Used By: Data Model, Data Section, Data Table

Vertical Wrap Detection/Vertical Wrap

Vertical Wrap Detection enables simplified extraction of multi-line text segments that are stacked vertically within a document. Vertical Wrap Detection can be used by Content Types configured with a Labeling Behavior and by the List Match and Label Match Value Extractors.

  • "Vertical Wrap Detection" is the embedded object that actually performs wrap detection in Grooper. Vertical Wrap Detection is enabled and configured with the "Vertical Wrap" property found in configuration items that support it.

Properties

A property is a mechanism by which an object in Grooper is configured that affects how the object performs its function.

Alignment

"Alignment" refers to how Grooper highlights text from an AI response on a document in a Document Viewer. Alignment properties can be configured to alter how Grooper highlights results when using LLM-based extraction methods, such as AI Extract.

Confidence Multiplier and Output Confidence

Some results carry more weight than others. The Confidence Multiplier and Output Confidence properties allow you to manually adjust an extraction result's confidence.

Constrained Wrap

The Constrained Wrap property allows certain Value Extractors and the Labeling Behavior to match values which wrap from one line to the next inside a box (such as a table cell).

Content Type Filter

The Content Type Filter property restricts Activities to specific collections_bookmark Content Categories and/or description Document Types.

Import Mode

Import Mode is a configurable property for CMIS Import providers. This controls how file content is loaded into a Grooper Repository during an Import Job. This property is key to setting up a "Sparse" import in Grooper.

Output Extractor Key

The Output Extractor Key property is another weapon in the arsenal of powerful Grooper classification techniques. It allows pin Data Types to return results normalized in a way more beneficial to document classification.

Parameters

Parameters is a collection of properties used in the configuration of LLM constructs. Temperature, TopP, Presence Penalty, and Frequency Penalty are parameters that influence text generation in models. Temperature and TopP control the diversity and probability distribution of generated text, while Presence Penalty and Frequency Penalty help manage repetition by discouraging the reuse of words or phrases.

Scope

The Scope property of a edit_document Batch Process Step, as it relates to an Activity, determines at which level in a inventory_2 Batch hierarchy the Activity runs.

Secondary Types

Secondary Types allow the application of multiple Content Types to a single folder Batch Folder.

Tab Marking

Tab Marking allows you to insert tab characters into a document's text data.

Misc Features and Functionality

CSS Data Viewer Styling

CSS Data Viewer Styling refers to using CSS to custom style the Review activity's Data Viewer interface. This gives you a great deal of control over a data_table Data Model's appearance and layout during document review.

EDI Integration

EDI Integration refers to Grooper's ability to process EDI files.

Fine-Tuning for AI Extract

Fine-tuning is the process of further training a large language model (LLM) on a specific dataset to make it more specialized for a particular task or domain. This allows the model to adapt its general language understanding to better handle the unique vocabulary, style, and structure of the domain it's fine-tuned on.
In Grooper, you can easily start fine-tuning a model based on a data_table Data Model that will facilitate better extraction when using AI Extract.

Footer Rows and Footer Modes

A "Footer Row" is a row at the bottom of a table Data Table that displays sum totals for numerical view_column Data Columns. This can help Data Viewer users validate data Grooper extracts for one or more Data Columns. The Data Column's "Footer Mode" controls if a sum calculation is performed or not (and if Tabular Layout's "Capture Footer Row" creates the Footer Row if and how document data is used to capture and validate the footer value).

Label Sets

Label Sets are collections of label definitions used in Grooper to identify and extract information from documents. A label set maps document text—such as field names, headers, or column titles—to corresponding Data Field, Data Section, or Data Table elements in the Data Model. Label sets are essential for automating extraction and classification, especially in environments where document layouts and terminology may vary.

URL Endpoints for Review

Three different URL endpoints can be used to open Review tasks in the Grooper Web Client, given certain information like the Grooper Repository ID, settings Batch Process name, inventory_2 Batch Id and more. This allows Grooper users to link directly to a Batch in Review with a URL.

XML Schema Integration

XML Schema Integration refers to Grooper's ability to use XML schemas to build Data Models, extract XML documents, and more.

UI Element

A UI Element is a portion of the Grooper interface that allows users to interact with or otherwise receive information about the application.

Data Inspector

The Grooper Data Inspector is a UI Element that can be found anywhere there is a Document Viewer showing extraction results. This UI Element allows a user to inspect the Data Instance hierarchies of an extracted result.

Design Page

GrooperReview.Pages.Design.DesignPage

The Design Page is the primary user interface for Grooper configuration. It is the central workplace for Grooper designers and administrators. From the Design page, users create, test and administer nodes in a Grooper Repository.

Document Viewer

The Grooper Document Viewer is the portal to your documents. It is the UI that allows you to see a folder Batch Folder's (or a contract Batch Page's) image, text content, and more.

Node Tree

The Node Tree is the hierarchical list of Grooper node objects found in the left panel in the Design Page. It is the basis for navigation and creation in the Design Page.

Overrides

Overrides is a tab provided to allow overriding of default properties set to a Data Element.

Search Page

The Search Page allows users to leverage AI Search indexes to query indexed documents. Both full text and metadata searches are supported, with feature rich querying and filtering capabilities. Users can interact with search results in several ways. They can view documents in the Document Viewer, review documents' extracted data, create new inventory_2 Batches from the result set, submit processing jobs, start a conversation with an psychology AI Assistant and more.

Scan Viewer

The Scan Viewer is a user interface that can be added to the user-attended person_search Review step in a settings Batch Process. It is used to scan documents into inventory_2 Batches from one or more scanning workstations.

Summary Tabs

stacks Content Models and collections_bookmark Content Categories have a Summary tab where you can view "Descendant Node Types", description Document Types, and Expressions.

Other

Concepts

There are many objects and properties a user can configure in Grooper, however, gaining an understanding how, why, and when to use these objects and properties is powered by one's understanding of the underlying concepts that define what what these objects and properties are doing and why.

Activity Processing

Activity Processing is the execution of a sequence of configured tasks which are performed within a settings Batch Process to transform raw data from documents into structured and actionable information. Tasks are defined by Grooper Activities, configurated to perform document classification, extraction, or data enhancement.

CMIS+

CMIS+ is a conceptual term that refers to Grooper's connectivity architecture to external storage platforms. CMIS+ standardizes connections to a variety of content management system based on the CMIS standard. This provides a standardized setup to allow Grooper to interoperate with both CMIS compliant systems and non-CMIS systems. It further provides normalized access to document content and metadata for import (CMIS Import) and export (CMIS Export) operations.

CMIS

CMIS (Content Management Interoperability Services) is open standard allowing different content management systems to "interoperate", sharing files, folders and their metadata as well as programmatic control of the platform over the internet.

Classification

Classification is the process of identifying and organizing documents into categorical types based on their content or layout. Classification is key for efficient document management and data extraction workflows. Grooper has different methods for classifying documents. These include methods that use machine learning and text pattern recognition. In a Grooper Batch Process, the Classify Activity will assign a Content Type to a folder Batch Folder.

Code Expressions

Code Expressions (not to be confused with regular expressions) are snippets of VB.NET code that expand Grooper's core functionality.

Data Context

Data Context refers to contextual information used to extract data, such as a label that identifies the value you want to collect.

Data Extraction

Data Extraction involves identifying and capturing specific information from documents (represented by folder Batch Folders in Grooper). Extraction is performed by configurable Data Extractors, which transform unstructured or semi-structured data into a structured, usable format for processing and analysis.

Data Extractor

Data Extractor (or just "extractor") refers to all Value Extractors and Extractor Nodes. Extractors define the logic used to return data from a document's text content, including general data (such as a date) and specific data (such as an agreement date on a contract).

Data Instance

A Data Instance is an encapsulation of text data within a document returned by Grooper's extractors. Data instances are the hierarchy of text data created by Grooper's extractors.

Expressions

Expressions (not to be confused with regular expressions) are snippets of VB.NET code that expand Grooper's core functionality.

Expressions Cookbook

The "Expressions Cookbook" is a reference list for commonly used Code Expressions in Grooper.

Field Mapping

Field Mapping refers to how logical connections are made between metadata content in Grooper and an external storage platform.

Five Phases of Grooper

The "Five Phases of Grooper" is a conceptual term that seeks to build understanding of how documents are processed through Grooper.

Flow Collation

"Flow Collation" refers to the text-flow based layout option used by various Collation Providers forpin Data Type extractors.

Fuzzy RegEx

Fuzzy RegEx is Grooper's use of fuzzy logic within Value Extractors that leverage regular expressions to match patterns. Fuzzy RegEx allows extractors to overcome defects in a document's OCR results to accurately return results. Fuzzy RegEx is enabled by enabling the Fuzzy Matching property.

GPT Integration

Grooper's GPT Integration is refers to the usage of OpenAI's GPT models within Grooper to enhance the capabilities of data extractors, classification, and lookups.

Grooper Infrastructure

Grooper Infrastructure refers to the computing underpinnings of what makes up a Grooper Repository and the software that allows the Grooper platform to automate tasks and users to interface with it.

Grooper Repository

A Grooper Repository is the environment used to create, configure and execute objects in Grooper. It provides the framework to "do work" in Grooper. Fundamentally, a Grooper Repository is a connection to a database and file store location, which store the node configurations and their associated file content. The Grooper application interacts with the Grooper Repository to automate tasks and provide the Grooper user interface.

Image Processing

"Image processing", as a general term, refers to software techniques that manipulate and enhance images. Image processing removes imperfections and adjusts images to improve OCR accuracy. In Grooper, images are processed primarily by two Activities:

  • Image Processing - This Activity permanently adjusts the image using. It is primarily used to compensate for defects produced by a document scanner (like border artifacts and skewed images). It does so by applying IP Commands in an perm_media IP Profile.
  • Recognize - This Activity performs OCR. When an library_books OCR Profile references an perm_media IP Profile, the image will be processed temporarily. A temporary image is handed to the OCR engine and discarded once characters are recognized.
  • Grooper also has "computer vision" capabilities that analyze and interpret images. These capabilities are also executed during Grooper's image processing. For example, Grooper's "Line Removal" command will locate lines on an image (computer vision), remove those artifacts to improve OCR results during Recognize (image processing) and store that data for later use in Grooper (computer vision).

LINQ to Grooper Objects

LINQ is Microsoft .NET component that provides data querying capabilities to the .NET framework. In Grooper, you can use the LINQ syntax in Code Expressions to "LINQ to Grooper Objects". This allows expressions to access information from collections of data, such as from multi-instance Data Sections or Data Tables.

Layout Data

Layout Data refers to visual information Grooper certain IP Commands collect, such as lines, checkboxes, barcodes, and detected shapes. This data is stored in a "Grooper.Layout.json" file attached to contract Batch Pages. Layout data is used by certain extractors and other features that rely on the presence of that data to function.

Microfiche Processing

Microfiche Processing refers to Grooper's suite of specialized Activities and IP Commands that process microfiche documents.

Microsoft Office Integration

Grooper's Microsoft Office Integration allows the platform to easily convert Microsoft Word and Microsoft Excel files into formats that Grooper can read natively (PDF and CSV).

Mixed Classification

"Mixed Classification" refers to leveraging a Classify Method and "rules" defined on a description Document Type to overcome the shortcomings of an individual method.

OCR

OCR is stands for Optical Character Recognition. It allows text on paper documents to be digitized, in order to be searched or edited by other software applications. OCR converts typed or printed text from digital images of physical documents into machine readable, encoded text.

OCR Synthesis

OCR Synthesis refers to a suite of OCR related functionality unique to Grooper. The OCR Synthesis suite will pre-process and re-process raw results from the OCR Engine and synthesize its results into a single, more accurate OCR result.

Object Nomenclature

The Grooper Wiki's Object Nomenclature defines how Grooper users categorize and refer to different types of Node Objects in a Grooper Repository. Knowing what objects can be added to the Grooper Node Tree and how they are related is a critical part of understanding Grooper itself.

PDF Page Types

PDF pages can be one of several PDF Page Types. "Page types" describe the kind of content in a PDF page. This informs Grooper how certain Activities should process the page. For example, "single image" pages are OCR'd by the Recognize activity, where "text only" pages have their native text extracted by Recognize.

Prompt Engineering

"Prompt Engineering" is the process of designing and refining prompts to interact more effectively with large language models (LLMs) like GPT-4. The goal is to guide the model to produce desired outputs by carefully crafting the input queries.

Regular Expression

Regular Expression (or regex) is a standard syntax designed to parse text strings. This is a way of finding information in text. It is the primary method by which Grooper extracts and returns data from documents.

Separation

Separation is the process of taking an unorganized inventory_2 Batch of loose contract Batch Pages and organizing them into documents represented by folder Batch Folders in Grooper. This is done so Grooper can later assign a description Document Type to each document folder in a process known as "classification".

TF-IDF

TF-IDF stands for term frequency-inverse document frequency. It is a statistical calculation intended to reflect how important a word is to a document within a document set (or "corpus"). It is how Grooper uses machine learning for training-based document classification (via the Lexical method) and data extraction (via the input Field Class extractor).

Table Extraction

"Table Extraction" refers to Grooper's ability to extract data from cells in tables on documents. This is accomplished by configuring the table Data Table and its child view_column Data Column elements in a data_table Data Model.

Thread

A Thread is the smallest unit of processing that can be performed within an operating system. In Grooper, threads are allocated for processing by Activity Processing services.

Training-Based Approaches to Document Classification

"Training-Based Approaches to Document Classification" refers to Grooper Classify Methods that classify folder Batch Folders using document examples for each description Document Type. The Classify activity then assigns unclassified Batch Folders a Document Type based on how similar it is to the Document Type's training data.

Training Batch

The Training Batch is a special inventory_2 Batch created when training document examples using the Lexical classification method. The Training Batch service two purposes: (1) It is a Batch that holds all previously trained folder Batch Folders. Designers can go to this Batch to view these documents and copy and paste them into other Batches if needed. (2) Batch Folders in the Training Batch will be used to re-train the Content Model's classification data when the Rebuild Training command is executed.

UNC Path

UNC Path is a conceptual term that refers to UNC (Universal Naming Convention) which is a standard used in Microsoft Windows for accessing shared network folders.

Waterfall Classification

Waterfall Classification is a classification technique in Grooper that prioritizes training similarity over classification "rules" set by a description Document Type's Positive Extractor. This can be helpful in scenarios where folder Batch Folders get misclassified and simply retraining won't help.

Disambiguation

Repository

A "repository" is a general term in computer science referring to where files and/or data is stored and managed. In Grooper, the term "repository" may refer to:

Base Types

Grooper Object

Grooper.GrooperObject

Connected Object

Grooper.ConnectedObject

Database Row

Grooper.DatabaseRow

Embedded Object

Grooper.EmbeddedObject

Classification is then performed using the Classify activity, using the trained examples and Lexical property configuration on a Content Model.

About

Lexical classification can be enabled and configured on any Content Model object. To do so, select the Classification Method property and select Lexical.



What are you classifying? - Document Types

As mentioned before, Lexical classification is a training-based approach. Generally speaking, a training-based approach is one where examples of a document to classify more documents as one or another. Essentially, the whole point is to distinguish one type of document from another.

This may be obvious, but before you can give examples of what one type of document looks like, you have to give a name to that type of document you're wanting to classify. In Grooper, we do this by adding Document Type objects to a Content Model

For example, imagine you have a collection of human resources documents. For each employee, you'll have a variety of different kinds of documents in their HR file, such as a federal W-4 form, their employment application, various documents pertaining to their health insurance enrollment, and more. In order to distinguish those documents from one another (in other words, classify them), you will need to add a Document Type for each kind of document.

Take the four kinds of documents seen here: A federal W-4, an employee data sheet, an FSA enrollment form, and a pension enrollment form

Federal W-4 Employee Data Sheet FSA Enrollment Form Pension Enrollment Form


If we want to classify a Batch of these documents and assign the W-4 documents a "W-4" classification and so on, we would need to create a Content Model and add one Document Type for each kind of document.

A Content Model is how we determine the taxonomy of our documents set. Taxonomy is just a fancy word for a classification scheme. Zoological taxonomy organizes organisms into a classification scheme, from domain all the way down to species. We do much the same thing with documents and a Content Model.

The whole set of HR documents belong to the top level in the hierarchy, the Content Model itself. Each individual kind of document are represented by Document Types, which are next level down in that hierarchy. Each one is distinct from each other, but still part of the Content Model's scope. Just like insects, spiders, and lobsters are distinct from each other but are all part of the "arthropod" zoological class.

How are documents classified? - Trained Examples

The Lexical method uses trained examples for each Document Type in order to classify Batches. During the Classify activity, unclassified documents are compared to trained examples of the Document Types in a Content Model. The document will be assigned the Document Type it is most similar to.

You can train documents using the "Classification Testing" tab of a Content Model (We will go into this more in depth in the How To section of this article).

You then train a document using the "Train Document" button. After you press this button, you select which Document Type corresponds to the document you're training.

So for this example, we've selected a W-4 form and chose the corresponding "Federal W-4" Document Type.

This will create two new levels of hierarchy in your Content Model. Training a document will create a Form Type of that document as a child of the Document Type assigned. The Form Type will have its own Page Type children corresponding to each page of the trained document.

You will create multiple Form Types for multiple trained examples of documents of varying lengths. You will create a 2-Page Form Type for documents of two pages in length (with two Page Type child objects), a 1-Page Form Type for single page documents (with a single Page Type object), a 10-Page Form Type for ten page documents (with ten Page Type children).

What is being trained? - Text Features

When it comes time to compare unclassified documents to trained examples, specifically what is compared is the lexical content of the documents. In other words, words. Documents use language to convey information. Words and phrases are features of what makes one document distinct from another. Words used in the documents one Document Type will share some meaningful similarities, which will be different from the language of another Document Type.

In order to find this lexical content, you first need to set a Text Feature Extractor. A Text Feature Extractor is set to extract text-based values from document samples to be used as identifiable features of the document.

Commonly, the extractor used here locates unigrams (single words), bigrams (two word phrases) or trigrams (three word phrases) as the features. However, a Text Feature Extractor is highly configurable, allowing you to use lexicons specific to your document set, exclude text from portions of a document from training, even use tokenized features of non-lexical text, and more.

This is the first thing you will do when configuring Lexical classification. If you're training the words in a document, you need to tell Grooper how to find those words first! After Lexical is chosen as the Classification Method of a Content Model, the Text Feature Extractor can be set in the Lexical sub-properties. This can be a Reference to a Data Type or an Internal regular expression pattern.

FYI Any Data Type can be a Text Feature Extractor. You can customize this extractor however best suits your document classification needs. However, there are a few pre-built feature extractors that ship with every Grooper install. You can find them in the Data Extraction folder and the following folder path: Data Types > Downloads > Features.

How are features trained? - TF-IDF

TF-IDF stands for "Term Frequency-Inverse Document Frequency". It is a numerical statistic intended to reflect how important a word is to a document within a collection (or document set or “corpus”). This “importance” is assigned a numerical weighting value. The higher the word’s weighting value, the more relevant it to that document in the corpus (or how similar it is to a Document Type for our purposes).

Text features (extracted from the Text Feature Extractor) are given weightings according to the TF-IDF algorithm. Features are given a higher weighting the more they appear on a document (Term Frequency), mitigated by if that feature is common to multiple Document Types (Inverse Document Frequency). Some words appear more generally across documents and hence are less unique identifiers. So, their weighting is lessened.

During a Classify activity, the features of an unclassified document are compared to the weighted features of the trained Document Types. The document is assigned the Document Type it is most similar to.

For a more in depth explanation of TF-IDF, visit the TF-IDF article.

Mixed Classification: Combining Training-Based and Rules-Based Approaches

Furthermore, a Rules-Based approach can be taken in combination with the training based approach, when using the Lexical Classification Method. This can be done by setting a Positive extractor on the Document Type object of a Content Model. If the extractor yields a result, the document will be classified as that Document Type. Generally, this will "win out" over the training weightings, because the Positive Extractor's confidence result (as a percentage value) will be higher than the document's similarity to the trained examples (as a percentage value) for a Document Type.

This way, if you have a value that can be extracted that you know is going to be on a Document Type, you can take advantage of setting a Positive Extractor on the Document Type to classify them. For example, document titles are often used as "rules". If you can extract text to match a title to a corresponding Document Type, this is often a quick and easy way to classify a document. But, if that extractor fails for whatever reason (because of bad OCR or a new title not matching the extractor's regex), you have training data which can act as a backup classification method.

Many of the best classification strategies involve combining the training-based Lexical method with a rules-based approach.

How To

Enable Lexical Classification

Train Documents

View Weighting Details