2023.1:Visual (Classify Method): Difference between revisions

From Grooper Wiki
Line 72: Line 72:
<br>
<br>
<br>
<br>
#<li value=15> With the document training completed, navigate to the Activity Tester tab.
# We're now ready to test Visual Classification on an, as of yet, unclassified Batch.
[[File:20231_Visual_(Classification_Method)_The_Classification_Process_07.png]]
[[File:20231_Visual_(Classification_Method)_The_Classification_Process_07.png]]
<br>
<br>

Revision as of 11:49, 1 April 2024

STUB

This article is a stub. It contains minimal information on the topic and should be expanded.

The Visual classification method uses image data instead of text data to determine the Document Type. Instead of using text-based extractors, an IP Profile is used with an Extract Features command to obtain data pertaining to a document's image. Document samples are trained as examples of a Document Type.

About

Similar to Lexical Classification, Visual Classification relies on a training-based approach. Where Grooper is trained to classify specific documents based off of examples given. The difference is, that there is no text involved. Instead, Visual Classification relies on how the document looks. Specifically, Visual Classification looks at pixel intensity across the document, and classifies accordingly. Pixel intensity refers to how dark a pixel is.

How To: Set Up Visual Classification

Before You Begin

Before you set up your Classification Method, you will need some things first. Namely, an IP Profile and an Extract Features IP Step. What Extract Features does, is create an NxN series of matrices for each document page's image


As Visual Classification is training-based, you will need Document Types as well.

  1. For an example later in this article, we'll be using four Document Types.


What Is 'Intensity'?

Intensity is a feature used by Grooper to determine how dark a cell is. When Grooper runs the Extract Features IP Command, it divides the pages of a document into cells and measures each one by how dark, or "intense" it is. The document is then given a percentage similarity score to each Document Type. Whichever Document Type has the highest percentage similarity is assigned to the document. In the case of the "intensity" example, each cell's intensity is compared with the training example to determine similarity via the black to white pixels ratio.

Think of a structured form, where the lines and text change very little. Therefore, if the document is divided into cells, the percentage of black pixels in that cell will be very similar from document to document.

Below is one of the documents we'll be using to illustrate how Visual Classification works. For the moment, we'll use it to briefly showcase intensity and how it works during the process of Visual Classification.

  1. Visual Classification is best put to use with highly structured documents. Since structured documents rely on their visual layout more than their text data, said text data could be considered irrelevant for classification. It doesn't really matter what name is on the document if it's going to be in the same place every time, for example.



  1. Once Extract Features has been run, the intensity is available for viewing. Once the document is trained and classification begins, Grooper will use this intensity grid as a basis, classifying any documents that look like this as the document type it was assigned.


The Classification Process

Now that everything has been set up, it's time to classify a set of Documents using the Visual Classification Method.

  1. With the IP Profile set up and the Document Types created, the Visual Classification Method can be set up on the Content Model
  2. Select the hamburger icon to the far right of the Classification Method property, and from the drop-down menu, select Visual



  1. After selecting the Visual Classification Method, expand the property to select the IP Profile the Classification Method will reference.
  2. Select the IP Profile from the drop-down menu.



  1. With all the prep work taken care of, there are just two things left to do: train the documents, then classify. This is done within the Classify step within a Batch Process.
  2. Ensure that the activity and scope are correct before proceeding.
  3. Expand the drop-down menu for the Content Model Scope, and select the desired Content Model.



  1. Navigate to the Classification Tester tab to train the documents.
  2. Right-click the Folder you wish to train.
  3. Select "Classification", followed by "Train As..."



  1. In the Train As window that populates, expand the drop-down menu for the Document Type you wish to select.
  2. Choose the desired document type. For this example, this first document we've selected is an Addendum an is being classified a such.
  3. Once the desired Document Type is selected, click "Execute".



  1. Repeat the process until all Folders have been trained as the desired Document Type.



  1. With the document training completed, navigate to the Activity Tester tab.
  2. We're now ready to test Visual Classification on an, as of yet, unclassified Batch.





For example, a common feature used is "intensity". The document is divided into cells and the percentage of black to white pixels is measured. During classification, Grooper looks at the values obtained by the IP Profile and compares them to those on the document to be classified. The document is then given a percentage similarity score to each Document Type. Whichever Document Type has the highest percentage similarity is assigned to the document. In the case of the "intensity" example, each cell's intensity is compared with the training example to determine similarity via the black to white pixels ratio.

Think of a structured form, where the lines and text change very little. Therefore, if the document is divided into cells, the percentage of black pixels in that cell will be very similar from document to document.

Visual classification is unique in that it does not require OCR. It can be performed real time during scanning.