2023.1:Waterfall Classification (Concept): Difference between revisions

From Grooper Wiki
Line 37: Line 37:
<br>
<br>
<br>
<br>
#<li value=8> With the Confidence Override set, select the Classify Activity Step within the Batch Process.
# Go to the Activity Tester tab.
# Select the misclassified Documents.
# Click the play button.
[[File:2023.1_Waterfall_Classification_02_Starting_the_Waterfall_04.png]]
[[File:2023.1_Waterfall_Classification_02_Starting_the_Waterfall_04.png]]
<br>
<br>

Revision as of 10:24, 12 April 2024

This article is about an older version of Grooper.

Information may be out of date and UI elements may have changed.

20252023.1

WIP

This article is a work-in-progress or created as a placeholder for testing purposes. This article is subject to change and/or expansion. It may be incomplete, inaccurate, or stop abruptly.

This tag will be removed upon draft completion.

Waterfall Classification is a classification concept in Grooper that manipulates the Positive Extractor property to prioritize training similarity in order to achieve a middle ground between high specificity and accuracy, and generality with minimal accuracy. This is helpful whenever Documents get misclassified, and simply retraining won't help.

ABOUT

Normally with classification, one can train a Document, set up a Positive Extractor for maximum accuracy, classify, get good results, and call it done. But what happens when high accuracy and specificity do more harm than good? For example, what if, due to the type of Positive Extractor being used, one Document gets erroneously classified as the wrong Document Type? You could always just make changes to the extractor, but who knows how long that would take, as well as what other problems that could create for classification. Instead, you can have your Extractor act as a safety net that classifies your the extracted data in a more general manner. This is the concept known as Waterfall Classification; manipulating the Positive Extractor along the downward curve of the waterfall away from high specificity and accuracy towards something somewhat more generic. Not completely generic, just training a similarity to where it can get the results we want for classification.

STARTING THE WATERFALL

How to set up Waterfall Classification? It isn't exactly a method you can select and let Grooper do all the work. Waterfall Classification is more of a concept, a name given to a technique that one can use to use when high accuracy alone cannot do a good job of classifying documents.

  1. Here, we have a set of Documents that have been classified incorrectly. They were supposed to have been classified as Title Opinions, as per the folder containing them. However, Grooper has classified them as Generic Letter Document Types.
  2. Looking down at the Similarity Scores, we see that Generic Letter scored higher than Title Opinion; Generic Letter coming in at 100%, and Title Opinion being 68%. Due to the higher score and significant gap, Generic Letter won, and is the Document Type these documents were erroneously classified as.



  1. Normally, when anything goes wrong in Classification, it's best to check the Document Type.
  2. Looking at the Document Type, we see that our Positive Extractor is using a Data Type as a reference. This CLAS-POS AND - Letter Data Type to be specific.
  3. So, let's follow the trail and look at the Data Type to see what's going wrong with Classification and what we can do about Generic Letter beating out Title Opinion.



  1. Nothing's wrong with our Collation Type, so what to do? Let's see if we can't control the output of our similarity scores and force the correct classification for Title Opinions. To do this, you'll want to open up the Result Options property.
  2. Once the Result Options window pops up, select the Confidence Override property. By default, it's set to 0%, as normally, we don't need to control the output for classification in Grooper. However, since we have the wrong Document Type being scored higher, we can fix this at 60%; meaning that when Grooper runs classification, Generic Letter will not be scored higher than 60% for similarity.



  1. With the Confidence Override set, select the Classify Activity Step within the Batch Process.
  2. Go to the Activity Tester tab.
  3. Select the misclassified Documents.
  4. Click the play button.