2023.1:Visual (Classify Method): Difference between revisions

From Grooper Wiki
No edit summary
Tag: Reverted
Line 11: Line 11:
* [[Media:2023.1_Wiki_Visual-(Classification-Method)_Project.zip]]
* [[Media:2023.1_Wiki_Visual-(Classification-Method)_Project.zip]]
|}
|}
== Glossary ==
<u><big>'''Activity'''</big></u>: {{#lst:Glossary|Activity}}
<u><big>'''Batch Folder'''</big></u>: {{#lst:Glossary|Batch Folder}}
<u><big>'''Batch Process'''</big></u>: {{#lst:Glossary|Batch Process}}
<u><big>'''Batch'''</big></u>: {{#lst:Glossary|Batch}}
<u><big>'''Classification Method'''</big></u>: {{#lst:Glossary|Classification Method}}
<u><big>'''Classification'''</big></u>: {{#lst:Glossary|Classification}}
<u><big>'''Classify'''</big></u>: {{#lst:Glossary|Classify}}
<u><big>'''Content Model'''</big></u>: {{#lst:Glossary|Content Model}}
<u><big>'''Document Type'''</big></u>: {{#lst:Glossary|Document Type}}
<u><big>'''Execute'''</big></u>: {{#lst:Glossary|Execute}}
<u><big>'''Extract'''</big></u>: {{#lst:Glossary|Extract}}
<u><big>'''IP Command'''</big></u>: {{#lst:Glossary|IP Command}}
<u><big>'''IP Profile'''</big></u>: {{#lst:Glossary|IP Profile}}
<u><big>'''IP Step'''</big></u>: {{#lst:Glossary|IP Step}}
<u><big>'''Lexical'''</big></u>: {{#lst:Glossary|Lexical}}
<u><big>'''Project'''</big></u>: {{#lst:Glossary|Project}}
<u><big>'''Scope'''</big></u>: {{#lst:Glossary|Scope}}
<u><big>'''Visual'''</big></u>: {{#lst:Glossary|Visual}}


== About ==
== About ==

Revision as of 13:47, 10 May 2024

This article is about an older version of Grooper.

Information may be out of date and UI elements may have changed.

20252023.1

"Visual" is a Classify Method that uses image analysis instead of text data to determine the description Document Type assigned to a folder Batch Folder during classification. Instead of using text-based extractors, an "Extract Features" IP Command in an perm_media IP Profile is used to collect image-based data from a Batch Folder's image(s). This image-based data is compared against that of previously trained document examples of each Document Type to classify the Batch Folder.

You may download the ZIP(s) below and upload it into your own Grooper environment (version 2023.1). The first contains one or more Batches of sample documents. The second contains one or more Projects with resources used in examples throughout this article.

Glossary

Activity: Grooper Activities define specific document processing operations done to a inventory_2 Batch, folder Batch Folder, or contract Batch Page. In a settings Batch Process, each edit_document Batch Process Step executes a single Activity (determined by the step's "Activity" property).

  • Batch Process Steps are frequently referred by the name of their configured Activity followed by the word "step". For example: "Classify step".

Batch Folder: The folder Batch Folder is an organizational unit within a inventory_2 Batch, allowing for a structured approach to managing and processing a collection of documents. Batch Folder nodes serve two purposes in a Batch. (1) Primarily, they represent "documents" in Grooper. (2) They can also serve more generally as folders, holding other Batch Folders and/or contract Batch Page nodes as children.

  • Batch Folders are frequently referred to simply as "documents" or "folders" depending on how they are used in the Batch.

Batch Process: settings Batch Process nodes are crucial components in Grooper's architecture. A Batch Process is the step-by-step processing instructions given to a inventory_2 Batch. Each step is comprised of a "Code Activity" or a Review activity. Code Activities are automated by Activity Processing services. Review activities are executed by human operators in the Grooper user interface.

  • Batch Processes by themselves do nothing. Instead, they execute edit_document Batch Process Steps which are added as children nodes.
  • A Batch Process is often referred to as simply a "process".

Batch: inventory_2 Batch nodes are fundamental in Grooper's architecture. They are containers of documents that are moved through workflow mechanisms called settings Batch Processes. Documents and their pages are represented in Batches by a hierarchy of folder Batch Folders and contract Batch Pages.

Classification Method:

Classification: Classification is the process of identifying and organizing documents into categorical types based on their content or layout. Classification is key for efficient document management and data extraction workflows. Grooper has different methods for classifying documents. These include methods that use machine learning and text pattern recognition. In a Grooper Batch Process, the Classify Activity will assign a Content Type to a folder Batch Folder.

Classify: unknown_document Classify is an Activity that "classifies" folder Batch Folders in a inventory_2 Batch by assigning them a description Document Type.

  • Classification is key to Grooper's document processing. It affects how data is extracted from a document (during the Extract activity) and how Behaviors are applied.
  • Classification logic is controlled by a Content Model's "Classify Method". These methods include using text patterns, previously trained document examples, and Label Sets to identify documents.

Content Model: stacks Content Model nodes define a classification taxonomy for document sets in Grooper. This taxonomy is defined by the collections_bookmark Content Categories and description Document Types they contain. Content Models serve as the root of a Content Type hierarchy, which defines Data Element inheritance and Behavior inheritance. Content Models are crucial for organizing documents for data extraction and more.

Document Type: description Document Type nodes represent a distinct type of document, such as an invoice or a contract. Document Types are created as child nodes of a stacks Content Model or a collections_bookmark Content Category. They serve three primary purposes:

  1. They are used to classify documents. Documents are considered "classified" when the folder Batch Folder is assigned a Content Type (most typically, a Document Type).
  2. The Document Type's data_table Data Model defines the Data Elements extracted by the Extract activity (including any Data Elements inherited from parent Content Types).
  3. The Document Type defines all "Behaviors" that apply (whether from the Document Type's Behavior settings or those inherited from a parent Content Type).

Execute: tv_options_edit_channels Execute is an Activity that runs one or more specified object commands. This gives access to a variety of Grooper commands in a settings Batch Process for which there is no Activity, such as the "Sort Children" command for Batch Folders or the "Expand Attachments" command for email attachments.

Extract: export_notes Extract is an Activity that retrieves information from folder Batch Folder documents, as defined by Data Elements in a data_table Data Model. This is how Grooper locates unstructured data on your documents and collects it in a structured, usable format.

IP Command: IP Commands specify an image processing (IP) operation (such as image cleanup, format conversion or feature detection) and are used to construct image IP Steps in an IP Profile. IP Commands are configured using an IP Step's Command property.

IP Profile: perm_media IP Profiles are a step-by-step list of image processing operations (IP Commands). They are used for several image processing related operations, but primarily for:

  1. Permanently enhancing an image during the Image Processing activity (usually to get rid of defects in a scanned image, such as skewing or borders).
  2. Cleaning up an image in-memory during the Recognize activity without altering the image to improve OCR accuracy.
  3. Computer vision operations that collect layout data (table line locations, OMR checkboxes, barcode value and more) utilized in data extraction.

IP Step: image IP Steps are the basic units of an perm_media IP Profile. They define a single image processing operation, called an IP Command in Grooper.

Lexical: "Lexical" is a Classify Method that classifies folder Batch Folders based on the text content of trained document examples. This is achieved through the statistical analysis of word frequencies that identify description Document Types.

Project: package_2 Projects are the primary containers for configuration nodes within Grooper. The Project is where various processing objects such as stacks Content Models, settings Batch Processes, profile objects are stored. This makes resources easier to manage, easier to save, and simplifies how node references are made in a Grooper Repository.

Scope: The Scope property of a edit_document Batch Process Step, as it relates to an Activity, determines at which level in a inventory_2 Batch hierarchy the Activity runs.

Visual: "Visual" is a Classify Method that uses image analysis instead of text data to determine the description Document Type assigned to a folder Batch Folder during classification. Instead of using text-based extractors, an "Extract Features" IP Command in an perm_media IP Profile is used to collect image-based data from a Batch Folder's image(s). This image-based data is compared against that of previously trained document examples of each Document Type to classify the Batch Folder.

About

Similar to Lexical Classification, Visual Classification relies on a training-based approach. Where Grooper is trained to classify specific documents based off of examples given. The difference is, that there is no text involved. Instead, Visual Classification relies on how the document looks. Specifically, Visual Classification looks at pixel intensity across the document, and classifies accordingly. Pixel intensity refers to how dark a pixel is.

How To: Set Up Visual Classification

Before You Begin

Before you set up your Classification Method, you will need some things first. Namely, an IP Profile and an Extract Features IP Step. What Extract Features does, is create an MxN series of matrices for each document page's image


As Visual Classification is training-based, you will need Document Types as well.

  1. For an example later in this article, we'll be using four Document Types.


What Is 'Intensity'?

Intensity is a feature used by Grooper to determine how dark a cell is. When Grooper runs the Extract Features IP Command, it divides the pages of a document into cells and measures each one by how dark, or "intense" it is. The document is then given a percentage similarity score to each Document Type. Whichever Document Type has the highest percentage similarity is assigned to the document. In the case of the "intensity" example, each cell's intensity is compared with the training example to determine similarity via the black to white pixels ratio.

Think of a structured form, where the lines and text change very little. Therefore, if the document is divided into cells, the percentage of black pixels in that cell will be very similar from document to document.

Below is one of the documents we'll be using to illustrate how Visual Classification works. For the moment, we'll use it to briefly showcase intensity and how it works during the process of Visual Classification.

  1. Visual Classification is best put to use with highly structured documents. Since structured documents rely on their visual layout more than their text data, said text data could be considered irrelevant for classification. It doesn't really matter what name is on the document if it's going to be in the same place every time, for example.



  1. Once Extract Features has been run, the intensity is available for viewing. Once the document is trained and classification begins, Grooper will use this intensity grid as a basis, classifying any documents that look like this as the document type it was assigned.

The Classification Process

Now that everything has been set up, it's time to classify a set of Documents using the Visual Classification Method.

  1. With the IP Profile set up and the Document Types created, the Visual Classification Method can be set up on the Content Model
  2. Select the hamburger icon to the far right of the Classification Method property, and from the drop-down menu, select Visual



  1. After selecting the Visual Classification Method, expand the property to select the IP Profile the Classification Method will reference.
  2. Select the IP Profile from the drop-down menu.



  1. With all the prep work taken care of, there are just two things left to do: train the documents, then classify. This is done within the Classify step within a Batch Process.
  2. Ensure that the activity and scope are correct before proceeding.
  3. Expand the drop-down menu for the Content Model Scope, and select the desired Content Model.



  1. Navigate to the Classification Tester tab to train the documents.
  2. Right-click the Folder you wish to train.
  3. Select "Classification", followed by "Train As..."



  1. In the Train As window that populates, expand the drop-down menu for the Document Type you wish to select.
  2. Choose the desired document type. For this example, this first document we've selected is an Addendum an is being classified a such.
  3. Once the desired Document Type is selected, click "Execute".



  1. Repeat the process until all Folders have been trained as the desired Document Type.



  1. With the document training completed, navigate to the Activity Tester tab.
  2. We're now ready to test Visual Classification on an, as of yet, unclassified Batch.



  1. All the folders in this Batch have been selected.
  2. The goal is to classify them according to the Visual training Grooper has been given.



  1. And voila! Having successfully trained Grooper on how to recognize certain Document Types visually, through features alone, such as intensity, all of these previously unclassified Documents have been classified.