2023:Lexical (Classify Method): Difference between revisions

From Grooper Wiki
Line 159: Line 159:


==== Unstructured Documents ====
==== Unstructured Documents ====
Not all documents are nice and easy. Sometimes, you have to classify documents that are more complicated for Grooper. Walls of text, OCR errors, these are the things that can make for difficult classification.
Not all documents are nice and easy. Sometimes, you have to classify documents that are more complicated for Grooper. Walls of text, OCR errors, these are the things that can make for difficult classification. Even if they're the same Document Type, documents can vary in appearance and text placement from document to document. A good way to start, is to look for the best possible example of the document that you wish to train as a particular document type and use that accordingly. From there, you can make use of feature extractors, and Document Frequency Smoothing if need be. While Lexicons and N-Grams could help, omitting stop words, or searching for singular terms as opposed to multi-word terms could lessen confusion on Grooper's end.


=== Mixed Classification:  Combining Training-Based and Rules-Based Approaches ===
=== Mixed Classification:  Combining Training-Based and Rules-Based Approaches ===

Revision as of 12:53, 22 February 2024

This article is about an older version of Grooper.

Information may be out of date and UI elements may have changed.

202520232021

The Lexical Classification Method is one of three methods of classifying documents available to Grooper. This method classifies documents according to their text content, obtained from OCR or extracted native PDF text (via the Recognize activity). It uses a training-based approach to "teach" Grooper to classify a document from trained examples of the Document Type.

Classification is then performed using the Classify activity, using the trained examples and Lexical property configuration on a Content Model.

About

Lexical classification can be enabled and configured on any Content Model object. To do so, select the Classification Method property and select Lexical.



What are you classifying? - Document Types

As mentioned before, Lexical classification is a training-based approach. Generally speaking, a training-based approach is one where examples of a document to classify more documents as one or another. Essentially, the whole point is to distinguish one type of document from another.

This may be obvious, but before you can give examples of what one type of document looks like, you have to give a name to that type of document you're wanting to classify. In Grooper, we do this by adding Document Type objects to a Content Model

For example, imagine you have a collection of human resources documents. For each employee, you'll have a variety of different kinds of documents in their HR file, such as a federal W-4 form, their employment application, various documents pertaining to their health insurance enrollment, and more. In order to distinguish those documents from one another (in other words, classify them), you will need to add a Document Type for each kind of document.

Take the four kinds of documents seen here: A federal W-4, an employee data sheet, an FSA enrollment form, and a pension enrollment form

Federal W-4 Employee Data Sheet FSA Enrollment Form Pension Enrollment Form


If we want to classify a Batch of these documents and assign the W-4 documents a "W-4" classification and so on, we would need to create a Content Model and add one Document Type for each kind of document.

A Content Model is how we determine the taxonomy of our documents set. Taxonomy is just a fancy word for a classification scheme. Zoological taxonomy organizes organisms into a classification scheme, from domain all the way down to species. We do much the same thing with documents and a Content Model.

The whole set of HR documents belong to the top level in the hierarchy, the Content Model itself. Each individual kind of document are represented by Document Types, which are next level down in that hierarchy. Each one is distinct from each other, but still part of the Content Model's scope. Just like insects, spiders, and lobsters are distinct from each other but are all part of the "arthropod" zoological class.

How are documents classified? - Trained Examples

The Lexical method uses trained examples for each Document Type in order to classify Batches. During the Classify activity, unclassified documents are compared to trained examples of the Document Types in a Content Model. The document will be assigned the Document Type it is most similar to.

You can train documents using the "Classification Testing" tab of a "Classify" Batch Process Step (We will go into this more in depth in the How To section of this article).

You then train a document by right-clicking on the document you wish to train, hovering over "Classify", and clicking "Train As..." when it pops up.

So, for this example, we've selected a W-4 form.

When the new "Train As" window pops up, you can click the hamburger icon to the right of the Content Type property and select the Content Model and then the Document Type from the drop down menu.

Here we classified this document as a "Federal W-4".

This will create two new levels of hierarchy in your Content Model. Training a document will create a Form Type of that document as a child of the Document Type assigned. The Form Type will have its own Page Type children corresponding to each page of the trained document.

You will create multiple Form Types for multiple trained examples of documents of varying lengths. You will create a 2-Page Form Type for documents of two pages in length (with two Page Type child objects), a 1-Page Form Type for single page documents (with a single Page Type object), a 10-Page Form Type for ten page documents (with ten Page Type children).

What is being trained? - Text Features

When it comes time to compare unclassified documents to trained examples, specifically what is compared is the lexical content of the documents. In other words, words. Documents use language to convey information. Words and phrases are features of what makes one document distinct from another. Words used in the documents one Document Type will share some meaningful similarities, which will be different from the language of another Document Type.

In order to find this lexical content, you first need to set a Text Feature Extractor. A Text Feature Extractor is set to extract text-based values from document samples to be used as identifiable features of the document.

Commonly, the extractor used here locates unigrams (single words), bigrams (two word phrases) or trigrams (three word phrases) as the features. However, a Text Feature Extractor is highly configurable, allowing you to use lexicons specific to your document set, exclude text from portions of a document from training, even use tokenized features of non-lexical text, and more.

This is the first thing you will do when configuring Lexical classification. If you're training the words in a document, you need to tell Grooper how to find those words first! After Lexical is chosen as the Classification Method of a Content Model, the Text Feature Extractor can be set in the Lexical sub-properties. This can be a Reference to a Data Type or an Internal regular expression pattern.

FYI

Any Data Type can be a Text Feature Extractor. You can customize this extractor however best suits your document classification needs. However, there are a few pre-built feature extractors that ship with every Grooper install. You can find them in the Data Extraction folder and the following folder path: Data Types > Downloads > Features.

Feature Extractor

But how to set up a Text Extractor? There are several methods one can use. Let's take a look at a quick, simple example.

How are features trained? - TF-IDF

TF-IDF stands for "Term Frequency-Inverse Document Frequency". It is a numerical statistic intended to reflect how important a word is to a document within a collection (or document set or “corpus”). This “importance” is assigned a numerical weighting value. The higher the word’s weighting value, the more relevant it to that document in the corpus (or how similar it is to a Document Type for our purposes).

Text features (extracted from the Text Feature Extractor) are given weightings according to the TF-IDF algorithm. Features are given a higher weighting the more they appear on a document (Term Frequency), mitigated by if that feature is common to multiple Document Types (Inverse Document Frequency). Some words appear more generally across documents and hence are less unique identifiers. So, their weighting is lessened.

During a Classify activity, the features of an unclassified document are compared to the weighted features of the trained Document Types. The document is assigned the Document Type it is most similar to.

For a more in depth explanation of TF-IDF, visit the TF-IDF article.

Logarithmic Term Frequency

Even amongst the same Document Type, not every document will have the same amount of pages. One Document may have ten pages, while another may have one hundred. Naturally, this could cause problems with term frequency. A longer document will have see more frequent uses of terms used for Classification. Higher frequency causes terms to have less weight. When terms have less weight, this results in documents potentially being classified incorrectly.

So, what to do? Simple: Switch the Term Frequency Mode from Normal to Logarithmic. In Normal TF-IDF, Grooper calculates the term weightings independently of document page count. With Logarithmic, varying page count between documents is taken into consideration, and the term frequency is scaled logarithmically. After all, when it comes to Classification, it's sometimes more important that the term even appeared on the document, as opposed to how many times it appeared on the document.

For more information on the math behind Logarithmic Term Frequency Mode, and the mode in general, see here: [1]

Document Frequency Mode

Different document types can share terminology. Such is the nature of language. So, how can we help Grooper tell one Document Type from another? The frequency of the terms. That's where Document Frequency comes in. Document Frequency comes in two modes: Normal and Smooth. In Normal mode, words that show up multiple times throughout different Document Types, like stop words, are given a lower weighting, often of zero, so that they will not interfere with Classification. In Smooth mode, those commonly occurring words do have weight, and stand to affect classification.

These weightings are calculated through a term frequency formula. For more information on the math behind said calculation, see here: [2]

Suffice it to say, the term frequency is calculated using a logarithmic function. The main difference between Normal and Smooth mode is that Normal converges to zero at a quicker rate than Smooth. Which mode is better? Whichever is preferred. However, one might need to keep in mind what kind of documents they're dealing with: structured or unstructured? In some cases, omitting frequently occurring terms, like stop words by using Normal Frequency Mode might do more harm than good with Classification.

Training Advice

Documents come in two flavors: structured and unstructured. The former is less intimidating to deal with. Cleanly organized and labeled information, simple for both Grooper and the user. Unstructured documents are a little more complex. While a human can still make sense of an unstructured document and recognize it, Grooper is going to have more trouble. Below, we offer a brief bit of advice on what to do when each kind of document is encountered.

Structured Documents

Structured Documents are just that - documents that are structured in such a way that the information is easily readable and identifiable, both by humans and by Grooper. Think of something like a W4 or a Data Info Sheet. All of the information is labeled and organized. This allows for Classification tools, such as Lexicons and N-Grams to do minimal (if any) lifting in regards to Classification. Lexicons are objects in the Node Tree that acts as a dictionary, storing lists of words, phrases, etc. that are commonly found throughout languages. With structured documents, omitting stop words ("a", "an", "the") can aid Classification even further. You can do this through a Lexicon.

N-Grams are Data Types that can be used to extract terms of varying length. "N" denotes the number of terms making up four desired extracted data - Unigrams are single word terms, Bigrams are two-word terms, and Trigrams are three-word terms. Sometimes it helps to extract a multi-word term as a opposed to a single word term when classifying documents. For example, the words "federal" and "tax" can appear by themselves on a variety of tax documents, but if you're looking for a document titled, "Federal Tax Form", then searching for "federal tax" using a bigram extractor, or even better a trigram extractor to extract the title as a whole, then you'll get better results with your classification.

Unstructured Documents

Not all documents are nice and easy. Sometimes, you have to classify documents that are more complicated for Grooper. Walls of text, OCR errors, these are the things that can make for difficult classification. Even if they're the same Document Type, documents can vary in appearance and text placement from document to document. A good way to start, is to look for the best possible example of the document that you wish to train as a particular document type and use that accordingly. From there, you can make use of feature extractors, and Document Frequency Smoothing if need be. While Lexicons and N-Grams could help, omitting stop words, or searching for singular terms as opposed to multi-word terms could lessen confusion on Grooper's end.

Mixed Classification: Combining Training-Based and Rules-Based Approaches

Furthermore, a Rules-Based approach can be taken in combination with the training based approach, when using the Lexical Classification Method. This can be done by setting a Positive extractor on the Document Type object of a Content Model. If the extractor yields a result, the document will be classified as that Document Type. Generally, this will "win out" over the training weightings, because the Positive Extractor's confidence result (as a percentage value) will be higher than the document's similarity to the trained examples (as a percentage value) for a Document Type.

This way, if you have a value that can be extracted that you know is going to be on a Document Type, you can take advantage of setting a Positive Extractor on the Document Type to classify them. For example, document titles are often used as "rules". If you can extract text to match a title to a corresponding Document Type, this is often a quick and easy way to classify a document. But, if that extractor fails for whatever reason (because of bad OCR or a new title not matching the extractor's regex), you have training data which can act as a backup classification method.

Many of the best classification strategies involve combining the training-based Lexical method with a rules-based approach.