Waterfall Classification (Concept)

From Grooper Wiki
(Redirected from Waterfall Classification)

This article was migrated from an older version and has not been updated for the current version of Grooper.

This tag will be removed upon article review and update.

This article is about the current version of Grooper.

Note that some content may still need to be updated.

2025 2023.1

Waterfall Classification is a classification technique in Grooper that prioritizes training similarity over classification "rules" set by a description Document Type's Positive Extractor. This can be helpful in scenarios where folder Batch Folders get misclassified and simply retraining won't help.


You may download the ZIP(s) below and upload it into your own Grooper environment (version 2023.1). The first contains one or more Batches of sample documents. The second contains one or more Projects with resources used in examples throughout this article.

ABOUT

Normally with classification, one can train a Document, set up a Positive Extractor for maximum accuracy, classify, get good results, and call it done. But what happens when high accuracy and specificity do more harm than good? For example, what if, due to the type of Positive Extractor being used, one Document gets erroneously classified as the wrong Document Type? You could always just make changes to the extractor, but who knows how long that would take, as well as what other problems that could create for classification. Instead, you can have your Extractor act as a safety net that classifies your the extracted data in a more general manner. This is the concept known as Waterfall Classification; manipulating the Positive Extractor along the downward curve of the waterfall away from high specificity and accuracy towards something somewhat more generic. Not completely generic, just training a similarity to where it can get the results we want for classification.

The Problem

In the following example, we'll take a look at a series of documents that were classified incorrectly. Three documents that were supposed to have been classified as Title Opinions were instead classified as Generic Letters. Nothing was wrong with the chosen method (Lexical), the Positive Extractor was configured a little too exact. So, let's see what can be done to classify the documents correctly.

  1. Here, we have a set of Documents that have been classified incorrectly. They were supposed to have been classified as Title Opinions, as per the folder containing them. However, Grooper has classified them as Generic Letter Document Types.
  2. Looking down at the Similarity Scores, we see that Generic Letter scored higher than Title Opinion; Generic Letter coming in at 100%, and Title Opinion being 68%. Due to the higher score and significant gap, Generic Letter won, and is the Document Type these documents were erroneously classified as.



  1. Normally, when anything goes wrong in classification, it's best to check the Document Type.
  2. Looking at the Document Type, we see that our Positive Extractor is using a Data Type as a reference. This "CLAS-POS AND - Letter" Data Type to be specific.
  3. So, let's follow the trail and look at the Data Type to see what's going wrong with classification and what we can do about Generic Letter beating out Title Opinion.

The Solution: Confidence Override & Creating the Waterfall

How does one perform Waterfall Classification? It isn't exactly a method you can select and let Grooper do all the work. Waterfall Classification is more of a concept, a name given to a technique that one can use to use when high accuracy alone cannot do a good job of classifying documents. What does that mean? Recalling the graphic from the beginning of the article, our goal is to strike a middle ground between the ultimate accuracy and specificity of the configured Positive Extractor and something more generic.

  1. Nothing's wrong with our 'Collation Type', so what to do? Let's see if we can't control the output of our similarity scores and force the correct classification for Title Opinions. To do this, you'll want to open up the Result Options property.
  2. Once the Result Options window pops up, select the Confidence Override property. By default, it's set to 0%, as normally, we don't need to control the output for classification in Grooper. However, since we have the wrong Document Type being scored higher, we can fix this at 60%; meaning that when Grooper runs classification, Generic Letter will not be scored higher than 60% for similarity.

The Result

With the Confidence Override set, it's now time to reclassify the documents and correct the misclassification.

  1. And look at that! What were once documents being misclassified as Generic Letters are now being classified correctly as Title Opinions.
  2. As you can see, adjusting the Confidence Override on the Generic Letter Document Type has allowed the Similarity Score of the Title Opinion Document Type to surpass that of the Generic Letter; resulting in the correct classification.



  1. Over on the Generic Letters, they're still being classified correctly, even with the Confidence Override being configured.
  2. This is because, even with the Confidence Override being fixed at 60%, it's still the highest Similarity Score for the Generic Letters. Thus, their classification remains correct and unchanged.



To summarize, we had an issue where the high accuracy of the Positive Extractor configured on the Document Type was working to our disadvantage. In this case, utmost accuracy was not our friend.