Mixed Classification (Concept)

From Grooper Wiki
(Redirected from Mixed Classification)

This article was migrated from an older version and has not been updated for the current version of Grooper.

This tag will be removed upon article review and update.

This article is about the current version of Grooper.

Note that some content may still need to be updated.

2025 2023.1

"Mixed Classification" refers to leveraging a Classification Methods and "rules" defined on a description Document Type to overcome the shortcomings of an individual method.


You may download the ZIP(s) below and upload it into your own Grooper environment (version 2023.1). The first contains one or more Batches of sample documents. The second contains one or more Projects with resources used in examples throughout this article.

ABOUT

When classifying a Batch with varying types of Documents, you may need to rely on more than just the Classification Method selected on the Content Model. This is often common when it comes to training-based Classification Methods (like the Lexical method). Grooper classifies Documents based on a confidence variable called a "Similarity Score". If "Document Type A" scores higher than "Document Type B", then the classification for "Document Type A" wins, regardless of whether or not that classification is actually correct.

Thankfully, you can configure a "rules-based" extractor on a particular Document Type to work in tandem with the training-based Classification Method. This "Mixed Classification" approach ensures not only does the Batch get classified as a whole but false positives are also avoided. Documents that would have been misclassified are assigned the proper Document Type via that Document Type's Positive Extractor.

The Problem

In this example, we have a Batch of documents that were classified using the Lexical Classification Method. Everything was classified properly, except for one document in the "Assignment" folder.

  1. Here, we have a document that was supposed to be classified as an 'Assignment' Document Type, but has been misclassified as a 'Memo' Document Type.
  2. This is due to the Similarity Score for Memo coming in at 61%, which is higher than the Assignment score that only came in at 55%. Unfortunately, since the incorrect Document Type scored higher, this document was classified as a Memo.
  3. We could train the Document as the "Assignment" Document Type and reclassify everything once more, but since the only real "Assignment" part of the document is the small, highlighted portion, it could cause issues for classification overall.
    • In fact, since this particular document bears a striking resemblance to a memo, it might mess up classification with the memos as well. So, training and reclassification is out of the question.

The Solution: Mixing Classification

So, we have an Assignment that was misclassified as a Memo. Training the document and reclassification could cause more issues in the long run, so what's to be done? This is where the concept of Mixing Classification comes in. We'll configure the Positive Extractor on the Document Type and have it work in tandem with the Classification Method so that every Document is properly classified.

  1. Select the Document Type which the problem Document was supposed to be classified as. In our case, that would be the "Assignment" Document Type.
  2. Remain on the Document Type tab.
  3. In addition to the Classification Method one can choose on the Content Model, the Document Type has its own Classification section. Here, one can configure both Positive and Negative Extractors. These Extractors work by either positively identifying data that you want to be extracted and thus used to classify a document as a particular Document Type, or to exclude text data from classification. Here, we'll be configuring the Positive Extractor. Select the hamburger icon at the far right of the property to expand the drop-down menu.
  4. Select List Match.




  1. In Local Entries, enter the title of the Document, "ASSIGNMENT OF OIL AND GAS LEASE"
  2. Since the title wraps around, we have Vertical Wrap enabled over on the Properties tab. This is what guarantees the title will be extracted. Otherwise, it would not be picked up.
  3. With that done, click OK.



With the Positive Extractor' configured, save and go back to the Classify Batch Process Step.

The Result

Now, we'll re-test classification on the contents of the Assignment folder, and our misclassified document will be corrected.

  1. Voila! Thanks to the configurations on the Positive Extractor, our Document has now been properly classified as an "Assignment" Document Type.



As you can see, by combining both the chosen Classification Method with the configured 'Positive Extractor on the Document Type, we can properly classify problematic documents that just the Classification Method or Positive Extractor alone would not be able to do. By mixing the rules-based Positive Extractor with the training-based Lexical Classification Method, we have ensured that each document is assigned the correct Document Type.