2023.1:Classification (Concept): Difference between revisions

Revision as of 12:13, 23 January 2024

STUB

This article is a stub. It contains minimal information on the topic and should be expanded.

Classification, in Grooper, is the process of assigning a Content Type (specifically a Document Type of a Content Model) to a Batch Folder.

About

As far as Grooper is concerned a document is a Batch Folder objects with Batch Page objects as its children. Before classification, the document (Batch Folder) is unclassified, or "blank". Grooper doesn't know what kind of document it is yet. To give an example, you can have both an invoice and a purchase order within your Batch, and Grooper won't know the two, never mind their classification, unless you perform Classification. Documents are classified by:

Most often, the Classify activity using training data or rules set on a Content Model
- A Classify step will automate document classification in a Batch Process. During the Classify activity, Grooper will use information from the document and its pages and configurations from a Content Model (such as the Classification Method used) to assign the document a Document Type from a Content Model.
In some cases, the Separate activity by assigning a Document Type to each new folder created
- For example, the ESP Auto Separation Separation Provider is a classification-based method of separation. It will both separate pages into document folders and classify the documents during the Separate activity.
Manually assigning a Document Type by right-clicking a Batch Folder and using the "Apply Document Type" command.

Classification and Data Extraction

Classification is performed before data extraction, and is actually a critical part of data extraction. Data extraction executes using configured Data Elements in a Data Model. A Data Model is part of a Content Model's hierarchy. Therefore, a document must be assigned a Content Type (specifically a Document Type of a Content Model) in order for the Extract activity to see the Data Models specifications for data extraction.

Until a document is classified, it has no Content Type assigned to it. It doesn't know which Content Model and corresponding Document Types and Data Models you're using to extract data. Without this information, Grooper will not understand which Data Elements to look for on which Document Types. Nor will it know the the extractors used to return values to the Data Elements in a Data Model.

In other words, the document must be classified (having a Document Type assigned to it) before performing the Extract activity.

Classification Methods

A document can be classified in a variety of ways, through training examples of a Document Type and matching similarity to the training data or creating extractor based rules using key words phrases or other text data (or a combination of the two). The method you choose is determined by the Classification Method property of a Content Model. There are four Classification Methods available in Grooper

⚠

In our documentation you may read about a "rules based" or "training based" classification approaches.

A "rules based" approach refers not only to the Rules-Based method but to using Positive and Negative Extractors in general to set up "classification rules".
A "training based" approach refers to using either the Lexical or Visual methods to classify documents using trained document samples.
A "mixed classification" approach would use both training and rules together to classify documents.

Each of the four different methods are described below. For further details, please click the links above to their respective articles that discuss each method in length.

Lexical

Lexical Classification is a particular Classification Method that relies upon a document's text. Naturally, OCR must be run beforehand, so that Grooper can read the text. To choose this method, simply go to your Content Model, expand the Classification Method property, and select Lexical.

insert image here

The Lexical Classification method is ideal to use for documents that can be classified by the type of verbiage they use. This is best exemplified for documents that do not have labels or obvious titles to help a user classify them. For example, something like a cover letter would make good use of Lexical Classification. While it doesn't have anything the other three methods can make use of, it does have several words that relate to employment. You can use this as a kind of key that will tell Grooper what words it needs to look for, and if they appear at a certain frequency, then that document will need to be classified as a cover letter.

Rules Based

Some documents have ways of identifying themselves — this could be through certain labels, or a title. For example, an invoice will be titled as such and have helpful labels such as "Invoice Date", or "Invoice Number."

Grooper can be trained to look for these details, and use them as rules by which to classify documents. In order to do this, we set up what's called a Positive Extractor on the Document Type. As long as the Positive Extractor returns at least one result, then a document will be classified as a certain Document Type. For more information on the Positive Extractor, see here: [1]

Visual

Visual Classification is different. Instead of relying on textual information, Visual Classification relies on how a document looks. More specifically, the arrangement of its pixels.

insert image here

This can be useful for classifying documents that have a table-like structure. For example,

insert image here

Unfortunately, the weakness of Visual Classification comes into play when you have two documents that are similar in their pixel arrangement. If two documents are similar in appearance, then they could be classified as the same document, regardless of whether or not they actually are.

Labelset-Based

@@ Line 62: / Line 62: @@
 === Visual ===
 Visual Classification is different. Instead of relying on textual information, Visual Classification relies on how a document looks. More specifically, the arrangement of its pixels.
+<br>
+<br>
+[[insert image here]]
+<br>
+<br>
+This can be useful for classifying documents that have a table-like structure. For example,
+<br>
+<br>
+[[insert image here]]
+<br>
+<br>
+Unfortunately, the weakness of Visual Classification comes into play when you have two documents that are similar in their pixel arrangement. If two documents are similar in appearance, then they could be classified as the same document, regardless of whether or not they actually are.
 === Labelset-Based ===
 [[Category:Articles]]