2023.1:Classify (Activity)

From Grooper Wiki
Revision as of 08:27, 28 August 2024 by Randallkinard (talk | contribs) (Randallkinard moved page Classify (Activity) to 2023.1:Classify (Activity))

This article is about an older version of Grooper.

Information may be out of date and UI elements may have changed.

20252023.1

unknown_document Classify is an Activity that "classifies" folder Batch Folders in a inventory_2 Batch by assigning them a description Document Type.

  • Classification is key to Grooper's document processing. It affects how data is extracted from a document (during the Extract activity) and how Behaviors are applied.
  • Classification logic is controlled by a Content Model's "Classify Method". These methods include using text patterns, previously trained document examples, and Label Sets to identify documents.

All the logic and setup to take an unorganized document set and categorize documents within them as Document Types is created in the Content Model. The Classify activity uses that information to automatically classify documents in a Batch as Document Types in the model.

You may download the ZIP(s) below and upload it into your own Grooper environment (version 2023.1). The first contains one or more Batches of sample documents. The second contains one or more Projects with resources used in examples throughout this article.

How To

Classification can be performed on a document in two different ways:

  1. Through the Classify step on an automated Batch Process.
  2. Manually, by right-clicking the document and using the "Apply Document Type" command.

You can also classify documents through ESP Auto Separation, but that is more of a seperation method performing both separation and classification upon the separation step.

Let's go over each method:

Automated Classification

To see how classification works during the activity, let's take a look at classification on the Activity Tester. This is basically what's happening when Grooper runs classification during the Batch Process.

  1. Naturally, you will need to have a Content Model with the appropriate Document Type created.
  2. Once that's been done, create a Batch Process and set up the Classify step.

FYI

  • Pay attention to the Folder Level on which classification is taking place. Grooper will default classification to Folder Level 1, which is the document level, just beneath the Batch itself (the Batch being Level 0).
  • Since we want to classify the folders as Document Types, we running classification at Level 2.


  1. Once the classification step has been properly configured, switch over to the Activity Tester tab.
  2. Press the Play button.
  3. And now all of the documents have been classified.


This is the function that Grooper will perform automatically when the Batch Process is run. It is important to test your Classify step first to make sure nothing goes awry during actual classification.

Manual Classification

  1. Select the desired Batch.
  2. Go to the Viewer tab.
    • You should already have your documents separated into individual folders at this point. If not, you will need to do so before classifying.
  3. Right-click the Document Folder and select the "Assign Document Type" command.
  4. Select a Document Type from the drop-down menu in the Assign Document Type window that populates.
    • You will need to have a Document Type created before you attempt classification, be it manual or automatic through a Batch Process.

For More Information

For more information on Classification as a whole, please see the following articles:

Glossary

Activity: Grooper Activities define specific document processing operations done to a inventory_2 Batch, folder Batch Folder, or contract Batch Page. In a settings Batch Process, each edit_document Batch Process Step executes a single Activity (determined by the step's "Activity" property).

  • Batch Process Steps are frequently referred by the name of their configured Activity followed by the word "step". For example: "Classify step".

Batch Folder: The folder Batch Folder is an organizational unit within a inventory_2 Batch, allowing for a structured approach to managing and processing a collection of documents. Batch Folder nodes serve two purposes in a Batch. (1) Primarily, they represent "documents" in Grooper. (2) They can also serve more generally as folders, holding other Batch Folders and/or contract Batch Page nodes as children.

  • Batch Folders are frequently referred to simply as "documents" or "folders" depending on how they are used in the Batch.

Batch Process: settings Batch Process nodes are crucial components in Grooper's architecture. A Batch Process is the step-by-step processing instructions given to a inventory_2 Batch. Each step is comprised of a "Code Activity" or a Review activity. Code Activities are automated by Activity Processing services. Review activities are executed by human operators in the Grooper user interface.

  • Batch Processes by themselves do nothing. Instead, they execute edit_document Batch Process Steps which are added as children nodes.
  • A Batch Process is often referred to as simply a "process".

Batch: inventory_2 Batch nodes are fundamental in Grooper's architecture. They are containers of documents that are moved through workflow mechanisms called settings Batch Processes. Documents and their pages are represented in Batches by a hierarchy of folder Batch Folders and contract Batch Pages.

Classification Method:

Classification: Classification is the process of identifying and organizing documents into categorical types based on their content or layout. Classification is key for efficient document management and data extraction workflows. Grooper has different methods for classifying documents. These include methods that use machine learning and text pattern recognition. In a Grooper Batch Process, the Classify Activity will assign a Content Type to a folder Batch Folder.

Classify: unknown_document Classify is an Activity that "classifies" folder Batch Folders in a inventory_2 Batch by assigning them a description Document Type.

  • Classification is key to Grooper's document processing. It affects how data is extracted from a document (during the Extract activity) and how Behaviors are applied.
  • Classification logic is controlled by a Content Model's "Classify Method". These methods include using text patterns, previously trained document examples, and Label Sets to identify documents.

Content Model: stacks Content Model nodes define a classification taxonomy for document sets in Grooper. This taxonomy is defined by the collections_bookmark Content Categories and description Document Types they contain. Content Models serve as the root of a Content Type hierarchy, which defines Data Element inheritance and Behavior inheritance. Content Models are crucial for organizing documents for data extraction and more.

Content Type: Content Types are a class of node types used used to classify folder Batch Folders. They represent categories of documents (stacks Content Models and collections_bookmark Content Categories) or distinct types of documents (description Document Types). Content Types serve an important role in defining Data Elements and Behaviors that apply to a document.

Document Type: description Document Type nodes represent a distinct type of document, such as an invoice or a contract. Document Types are created as child nodes of a stacks Content Model or a collections_bookmark Content Category. They serve three primary purposes:

  1. They are used to classify documents. Documents are considered "classified" when the folder Batch Folder is assigned a Content Type (most typically, a Document Type).
  2. The Document Type's data_table Data Model defines the Data Elements extracted by the Extract activity (including any Data Elements inherited from parent Content Types).
  3. The Document Type defines all "Behaviors" that apply (whether from the Document Type's Behavior settings or those inherited from a parent Content Type).

ESP Auto Separation: ESP Auto Separation is a Separation Provider used for document separation. It is unique in that it both separates and classifies documents at the same time. It uses page-level classification training examples (among other things) to determine where to insert document folders in a inventory_2 Batch.

Labeling Behavior: A Labeling Behavior extends "label set" functionality to description Document Types. This allows you to collect field labels and other labels present on a document and use them in a variety of ways. This includes functionality for classification, field extraction, table extraction, and section extraction.

Lexical: "Lexical" is a Classify Method that classifies folder Batch Folders based on the text content of trained document examples. This is achieved through the statistical analysis of word frequencies that identify description Document Types.

Project: package_2 Projects are the primary containers for configuration nodes within Grooper. The Project is where various processing objects such as stacks Content Models, settings Batch Processes, profile objects are stored. This makes resources easier to manage, easier to save, and simplifies how node references are made in a Grooper Repository.

Rules-Based: "Rules-Based" is a Classify Method that employs "rules" defined on each description Document Type to classify folder Batch Folders. Positive Extractor and Negative Extractor properties are configured for each Document Type to positively or negatively associate a Batch Folder based on predefined criteria.

  • Where the Positive and Negative Extractors will impact all Classify Method results, the Rules-Based method classifies using only these properties and nothing else.

Separation: Separation is the process of taking an unorganized inventory_2 Batch of loose contract Batch Pages and organizing them into documents represented by folder Batch Folders in Grooper. This is done so Grooper can later assign a description Document Type to each document folder in a process known as "classification".

Visual: "Visual" is a Classify Method that uses image analysis instead of text data to determine the description Document Type assigned to a folder Batch Folder during classification. Instead of using text-based extractors, an "Extract Features" IP Command in an perm_media IP Profile is used to collect image-based data from a Batch Folder's image(s). This image-based data is compared against that of previously trained document examples of each Document Type to classify the Batch Folder.