2023.1:Waterfall Classification (Concept): Difference between revisions

Revision as of 13:06, 29 April 2024

This article is about an older version of Grooper.

Information may be out of date and UI elements may have changed.

2025

2023.1

Waterfall Classification is a classification technique in Grooper that prioritizes training similarity over classification "rules" set by a description Document Type's Positive Extractor. This can be helpful in scenarios where folder Batch Folders get misclassified and simply retraining won't help.

You may download the ZIP(s) below and upload it into your own Grooper environment (version 2023.1). The first contains one or more Batches of sample documents. The second contains one or more Projects with resources used in examples throughout this article.

ABOUT

Normally with classification, one can train a Document, set up a Positive Extractor for maximum accuracy, classify, get good results, and call it done. But what happens when high accuracy and specificity do more harm than good? For example, what if, due to the type of Positive Extractor being used, one Document gets erroneously classified as the wrong Document Type? You could always just make changes to the extractor, but who knows how long that would take, as well as what other problems that could create for classification. Instead, you can have your Extractor act as a safety net that classifies your the extracted data in a more general manner. This is the concept known as Waterfall Classification; manipulating the Positive Extractor along the downward curve of the waterfall away from high specificity and accuracy towards something somewhat more generic. Not completely generic, just training a similarity to where it can get the results we want for classification.

Creating the Waterfall

How does one perform Waterfall Classification? It isn't exactly a method you can select and let Grooper do all the work. Waterfall Classification is more of a concept, a name given to a technique that one can use to use when high accuracy alone cannot do a good job of classifying documents.

Here, we have a set of Documents that have been classified incorrectly. They were supposed to have been classified as Title Opinions, as per the folder containing them. However, Grooper has classified them as Generic Letter Document Types.
Looking down at the Similarity Scores, we see that Generic Letter scored higher than Title Opinion; Generic Letter coming in at 100%, and Title Opinion being 68%. Due to the higher score and significant gap, Generic Letter won, and is the Document Type these documents were erroneously classified as.

Normally, when anything goes wrong in classification, it's best to check the Document Type.
Looking at the Document Type, we see that our Positive Extractor is using a Data Type as a reference. This "CLAS-POS AND - Letter" Data Type to be specific.
So, let's follow the trail and look at the Data Type to see what's going wrong with classification and what we can do about Generic Letter beating out Title Opinion.

Nothing's wrong with our 'Collation Type', so what to do? Let's see if we can't control the output of our similarity scores and force the correct classification for Title Opinions. To do this, you'll want to open up the Result Options property.
Once the Result Options window pops up, select the Confidence Override property. By default, it's set to 0%, as normally, we don't need to control the output for classification in Grooper. However, since we have the wrong Document Type being scored higher, we can fix this at 60%; meaning that when Grooper runs classification, Generic Letter will not be scored higher than 60% for similarity.

With the Confidence Override set, select the Classify Activity Step within the Batch Process.
Go to the Activity Tester tab.
Select the misclassified Documents.
Click the play button.

Ta-da! Our Documents are now being classified correctly!
As a cautionary measure, we've reclassified the Generic Letters as well. All is well, they're being still being classified as the correct Document Type as well.

Back on the Classification Tester tab, we can see how adjusting the Confidence Override has affected the scoring for similarity. Generic Letter now comes in at the 60% to which it was fixed, allowing the 68% score for Title Opinion to surpass it and thus, classify the Title Opinions as Title Opinions.

@@ Line 3: / Line 3: @@
 <blockquote>{{#lst:Glossary|Waterfall Classification}}</blockquote>
-[[File:waterfall-classification.jpg]]
 {|class="download-box"