Lexical (Classify Method): Difference between revisions

Revision as of 12:24, 6 October 2020

The Lexical Classification Method is one of three methods of classifying documents available to Grooper. This method classifies documents according to their text content, obtained from OCR or extracted native PDF text (via the Recognize activity). It uses a Training-Based Approach to "teach" Grooper to classify a document from trained examples of the Document Type.

Classification is then performed using the Classify activity, using the trained examples and Lexical property configuration on a Content Model.

About

Lexical classification can be enabled and configured on any Content Model object. To do so, select the Classification Method property and select Lexical.

What are you classifying? - Document Types

As mentioned before, Lexical classification is a training-based approach. Generally speaking, a training-based approach is one where examples of a document to classify more documents as one or another. Essentially, the whole point is to distinguish one type of document from another.

This may be obvious, but before you can give examples of what one type of document looks like, you have to give a name to that type of document you're wanting to classify. In Grooper, we do this by adding Document Type objects to a Content Model

For example, imagine you have a collection of human resources documents. For each employee, you'll have a variety of different kinds of documents in their HR file, such as a federal W-4 form, their employment application, various documents pertaining to their health insurance enrollment, and more. In order to distinguish those documents from one another (in other words, classify them), you will need to add a Document Type for each kind of document.

Take the four kinds of documents seen here: A federal W-4, an employee data sheet, an FSA enrollment form, and a pension enrollment form

Federal W-4	Employee Data Sheet	FSA Enrollment Form	Pension Enrollment Form

If we want to classify a Batch of these documents and assign the W-4 documents a "W-4" classification and so on, we would need to create a Content Model and add one Document Type for each kind of document.

A Content Model is how we determine the taxonomy of our documents set. Taxonomy is just a fancy word for a classification scheme. Zoological taxonomy organizes organisms into a classification scheme, from domain all the way down to species. We do much the same thing with documents and a Content Model.

The whole set of HR documents belong to the top level in the hierarchy, the Content Model itself. Each individual kind of document are represented by Document Types, which are next level down in that hierarchy. Each one is distinct from each other, but still part of the Content Model's scope. Just like insects, spiders, and lobsters are distinct from each other but are all part of the "arthropod" zoological class.

How are documents classified? - Trained Examples

The Lexical method uses trained examples for each Document Type in order to classify Batches. During the Classify activity, unclassified documents are compared to trained examples of the Document Types in a Content Model. The document will be assigned the Document Type it is most similar to.

You can train documents using the "Classification Testing" tab of a Content Model (We will go into this more in depth in the How To section of this article).

What is being trained? - Text Features

A Text Feature Extractor is set to extract values from document samples to be used as identifiable features of the document (such as words or phrases). These features are given weightings according to the TF-IDF algorithm. Features are given a higher weighting the more they appear on a document (Term Frequency), mitigated by if that feature is common to multiple Document Types (Inverse Document Frequency). During a Classify activity, the features of an unclassified document are compared to the weighted features of the trained Document Types. The document is assigned the Document Type it is most similar to.

How are documents trained? - TF-IDF

TF-IDF stands for "Term Frequency-Inverse Document Frequency".

Mixed Classification: Combining Training-Based and Rules-Based Approaches

Furthermore, a Rules-Based Approach can be taken in combination with the training based approach. This can be done by setting a positive extractor on the Document Type. If the extractor yields a result, the document will be classified as that type without being compared to training examples. This way, if you have a value that can be extracted that you know is going to be on a Document Type (such as a header title), you can take advantage of setting a positive extractor on the Document Type to classify them. But, if that extractor fails for whatever reason, you have training data which can act as a backup classification.

@@ Line 46: / Line 46: @@
 The ''Lexical'' method uses trained examples for each '''Document Type''' in order to classify '''Batches'''.  During the '''[[Classify]]''' activity, unclassified documents are compared to trained examples of the '''Document Types''' in a '''Content Model'''.  The document will be assigned the '''Document Type''' it is ''most'' similar to.
+{|cellpadding=10 cellspacing=5
+|style="width:40%" valign=top|
+You can train documents using the "Classification Testing" tab of a '''Content Model''' (We will go into this more in depth in the [[#How To|How To]] section of this article).
+|
+[[File:Lexical-classification-02.png]]
+|}
 === What is being trained? - Text Features ===