2023.1:Classification (Concept): Difference between revisions

From Grooper Wiki
No edit summary
(No difference)

Revision as of 08:28, 28 August 2024

This article is about an older version of Grooper.

Information may be out of date and UI elements may have changed.

20252023.1

Classification is the process of identifying and organizing documents into categorical types based on their content or layout. Classification is key for efficient document management and data extraction workflows. Grooper has different methods for classifying documents. These include methods that use machine learning and text pattern recognition. In a Grooper Batch Process, the Classify Activity will assign a Content Type to a folder Batch Folder.

About

As far as Grooper is concerned a document is a Batch Folder objects with Batch Page objects as its children. Before classification, the document (Batch Folder) is unclassified, or "blank". Grooper doesn't know what kind of document it is yet. To give an example, you can have both an invoice and a purchase order within your Batch, and Grooper won't know the two, never mind their classification, unless you perform Classification. Documents are classified by:

  1. Most often, the Classify activity using training data or rules set on a Content Model
    • A Classify step will automate document classification in a Batch Process. During the Classify activity, Grooper will use information from the document and its pages and configurations from a Content Model (such as the Classification Method used) to assign the document a Document Type from a Content Model.
  2. In some cases, the Separate activity by assigning a Document Type to each new folder created
    • For example, the ESP Auto Separation Separation Provider is a classification-based method of separation. It will both separate pages into document folders and classify the documents during the Separate activity.
  3. Manually assigning a Document Type by right-clicking a Batch Folder and using the "Apply Document Type" command.

Classification and Data Extraction

Why is Classification so important? What does it have to do with data and the Batch Process? Classification is performed before data extraction, and is actually a critical part of data extraction. Data extraction executes using configured Data Elements in a Data Model. A Data Model is part of a Content Model's hierarchy. Therefore, a document must be assigned a Content Type (specifically a Document Type of a Content Model) in order for the Extract activity to see the Data Models specifications for data extraction.

Until a document is classified, it has no Content Type assigned to it. It doesn't know which Content Model and corresponding Document Types and Data Models you're using to extract data. Without this information, Grooper will not understand which Data Elements to look for on which Document Types. Nor will it know the the extractors used to return values to the Data Elements in a Data Model.

In other words, the document must be classified (having a Document Type assigned to it) before performing the Extract activity.

Classification Methods

A document can be classified in a variety of ways, through training examples of a Document Type and matching similarity to the training data or creating extractor based rules using key words phrases or other text data (or a combination of the two). The method you choose is determined by the Classification Method property of a Content Model. There are four Classification Methods available in Grooper

  1. Lexical
  2. Rules-Based
  3. Visual
  4. Labelset-Based

In our documentation you may read about a "rules based" or "training based" classification approaches.

  • A "rules based" approach refers not only to the Rules-Based method but to using Positive and Negative Extractors in general to set up "classification rules".
  • A "training based" approach refers to using either the Lexical or Visual methods to classify documents using trained document samples.
  • A "mixed classification" approach would use both training and rules together to classify documents.

Each of the four different methods are described below. For further details, please click the links above to their respective articles that discuss each method in length.

Lexical

Lexical Classification is a particular Classification Method that relies upon a document's text. Naturally, OCR must be run beforehand, so that Grooper can read the text.

  1. To choose this method, simply go to your Content Model.
  2. Expand the Classification Method property.
  3. Select Lexical.


The Lexical Classification method is ideal to use for documents that can be classified by the type of verbiage they use. This is best exemplified for documents that do not have labels or obvious titles to help a user classify them. For example, something like a cover letter would make good use of Lexical Classification. While it doesn't have anything the other three methods can make use of, it does have several words that relate to employment. You can use this as a kind of key that will tell Grooper what words it needs to look for, and if they appear at a certain frequency, then that document will need to be classified as a cover letter.

Rules Based

Some documents have ways of identifying themselves — this could be through certain labels, or a title. For example, an invoice will be titled as such and have helpful labels such as "Invoice Date", or "Invoice Number."

Grooper can be trained to look for these details, and use them as rules by which to classify documents. In order to do this, we set up what's called a Positive Extractor on the Document Type. As long as the Positive Extractor returns at least one result, then a document will be classified as a certain Document Type. For more information on the Positive Extractor, see here: [1]

Visual

Visual Classification is different. Instead of relying on textual information, Visual Classification relies on how a document looks. More specifically, the arrangement of its pixels.

This can be useful for classifying documents that have a table-like structure. For example:



Unfortunately, the weakness of Visual Classification comes into play when you have two documents that are similar in their pixel arrangement. If two documents are similar in appearance, then they could be classified as the same document, regardless of whether or not they actually are.

Labelset-Based

Labels are a way that documents organize their information. Of course, not all documents are exactly alike. Two health insurance forms from two different companies will have the same exact information, but their labels might be different. One insurance company may refer to a client's identification as an "Insurance Number", while another may refer to it as "Information No."

For example, take a look at these invoices and how they label their invoice numbers.

The information falls under the same category, but the labels are different. If we were looking to classify these documents by their company, then we would use Labelset-Based Classification to identify documents using "Invoice Number." as NormanDog, and those using "Invoice #" as Alchemical.

For more detailed information on Labelset-Based Classification, please see the Labelset section of our Labeling Behavior article.

Glossary

Activity: Grooper Activities define specific document processing operations done to a inventory_2 Batch, folder Batch Folder, or contract Batch Page. In a settings Batch Process, each edit_document Batch Process Step executes a single Activity (determined by the step's "Activity" property).

  • Batch Process Steps are frequently referred by the name of their configured Activity followed by the word "step". For example: "Classify step".

Batch Folder: The folder Batch Folder is an organizational unit within a inventory_2 Batch, allowing for a structured approach to managing and processing a collection of documents. Batch Folder nodes serve two purposes in a Batch. (1) Primarily, they represent "documents" in Grooper. (2) They can also serve more generally as folders, holding other Batch Folders and/or contract Batch Page nodes as children.

  • Batch Folders are frequently referred to simply as "documents" or "folders" depending on how they are used in the Batch.

Batch Page: contract Batch Page nodes represent individual pages within a inventory_2 Batch. Batch Pages are created in one of two ways: (1) When images are scanned into a Batch using the Scan Viewer. (2) Or, when split from a PDF or TIFF file using the Split Pages activity.

  • Batch Pages are frequently referred to simply as "pages".

Batch Process: settings Batch Process nodes are crucial components in Grooper's architecture. A Batch Process is the step-by-step processing instructions given to a inventory_2 Batch. Each step is comprised of a "Code Activity" or a Review activity. Code Activities are automated by Activity Processing services. Review activities are executed by human operators in the Grooper user interface.

  • Batch Processes by themselves do nothing. Instead, they execute edit_document Batch Process Steps which are added as children nodes.
  • A Batch Process is often referred to as simply a "process".

Batch: inventory_2 Batch nodes are fundamental in Grooper's architecture. They are containers of documents that are moved through workflow mechanisms called settings Batch Processes. Documents and their pages are represented in Batches by a hierarchy of folder Batch Folders and contract Batch Pages.

Classification Method:

Classification: Classification is the process of identifying and organizing documents into categorical types based on their content or layout. Classification is key for efficient document management and data extraction workflows. Grooper has different methods for classifying documents. These include methods that use machine learning and text pattern recognition. In a Grooper Batch Process, the Classify Activity will assign a Content Type to a folder Batch Folder.

Classify: unknown_document Classify is an Activity that "classifies" folder Batch Folders in a inventory_2 Batch by assigning them a description Document Type.

  • Classification is key to Grooper's document processing. It affects how data is extracted from a document (during the Extract activity) and how Behaviors are applied.
  • Classification logic is controlled by a Content Model's "Classify Method". These methods include using text patterns, previously trained document examples, and Label Sets to identify documents.

Content Model: stacks Content Model nodes define a classification taxonomy for document sets in Grooper. This taxonomy is defined by the collections_bookmark Content Categories and description Document Types they contain. Content Models serve as the root of a Content Type hierarchy, which defines Data Element inheritance and Behavior inheritance. Content Models are crucial for organizing documents for data extraction and more.

Content Type: Content Types are a class of node types used used to classify folder Batch Folders. They represent categories of documents (stacks Content Models and collections_bookmark Content Categories) or distinct types of documents (description Document Types). Content Types serve an important role in defining Data Elements and Behaviors that apply to a document.

Data Element: Data Elements are a class of node types used to collect data from a document. These include: data_table Data Models, insert_page_break Data Sections, variables Data Fields, table Data Tables, and view_column Data Columns.

Data Model: data_table Data Models are leveraged during the Extract activity to collect data from documents (folder Batch Folders). Data Models are the root of a Data Element hierarchy. The Data Model and its child Data Elements define a schema for data present on a document. The Data Model's configuration (and its child Data Elements' configuration) define data extraction logic and settings for how data is reviewed in a Data Viewer.

Document Type: description Document Type nodes represent a distinct type of document, such as an invoice or a contract. Document Types are created as child nodes of a stacks Content Model or a collections_bookmark Content Category. They serve three primary purposes:

  1. They are used to classify documents. Documents are considered "classified" when the folder Batch Folder is assigned a Content Type (most typically, a Document Type).
  2. The Document Type's data_table Data Model defines the Data Elements extracted by the Extract activity (including any Data Elements inherited from parent Content Types).
  3. The Document Type defines all "Behaviors" that apply (whether from the Document Type's Behavior settings or those inherited from a parent Content Type).

ESP Auto Separation: ESP Auto Separation is a Separation Provider used for document separation. It is unique in that it both separates and classifies documents at the same time. It uses page-level classification training examples (among other things) to determine where to insert document folders in a inventory_2 Batch.

Extract: export_notes Extract is an Activity that retrieves information from folder Batch Folder documents, as defined by Data Elements in a data_table Data Model. This is how Grooper locates unstructured data on your documents and collects it in a structured, usable format.

Labeling Behavior: A Labeling Behavior extends "label set" functionality to description Document Types. This allows you to collect field labels and other labels present on a document and use them in a variety of ways. This includes functionality for classification, field extraction, table extraction, and section extraction.

Labelset-Based: "Labelset-Based" is a Classify Method that leverages the labels defined via a Labeling Behavior to classify folder Batch Folders.

Lexical: "Lexical" is a Classify Method that classifies folder Batch Folders based on the text content of trained document examples. This is achieved through the statistical analysis of word frequencies that identify description Document Types.

Machine: computer Machine nodes represent servers that have connected to the Grooper Repository. They are essential for distributing task processing loads across multiple servers. Grooper creates Machine nodes automatically whenever a server makes a new connection to a Grooper Repository's database. Once added, Machine nodes can be used to view server information and to manage Grooper Service instances.

OCR: OCR is stands for Optical Character Recognition. It allows text on paper documents to be digitized, in order to be searched or edited by other software applications. OCR converts typed or printed text from digital images of physical documents into machine readable, encoded text.

Rules-Based: "Rules-Based" is a Classify Method that employs "rules" defined on each description Document Type to classify folder Batch Folders. Positive Extractor and Negative Extractor properties are configured for each Document Type to positively or negatively associate a Batch Folder based on predefined criteria.

  • Where the Positive and Negative Extractors will impact all Classify Method results, the Rules-Based method classifies using only these properties and nothing else.

Separate: insert_page_break Separate is an Activity that sorts contract Batch Pages into individual folder Batch Folders. This distinguishes "loose pages" from the documents formed by those pages. Once loose pages are separated into Batch Folder documents, they can be further processed by unknown_document Classify, export_notes Extract, output Export and other Activities that need to run on the folder (i.e. document) level.

Separation Provider: The Provider property of the Separate Activity defines the type of separation to be performed at the designated Scope.

Separation: Separation is the process of taking an unorganized inventory_2 Batch of loose contract Batch Pages and organizing them into documents represented by folder Batch Folders in Grooper. This is done so Grooper can later assign a description Document Type to each document folder in a process known as "classification".

Visual: "Visual" is a Classify Method that uses image analysis instead of text data to determine the description Document Type assigned to a folder Batch Folder during classification. Instead of using text-based extractors, an "Extract Features" IP Command in an perm_media IP Profile is used to collect image-based data from a Batch Folder's image(s). This image-based data is compared against that of previously trained document examples of each Document Type to classify the Batch Folder.