2023.1:Mixed Classification (Concept): Difference between revisions

From Grooper Wiki
No edit summary
No edit summary
 
(5 intermediate revisions by 2 users not shown)
Line 1: Line 1:
{{AutoVersion}}
{{AutoVersion}}


<blockquote>{{#lst:Glossary|Combined Methods}}</blockquote>
<blockquote>{{#lst:Glossary|Mixed Classification}}</blockquote>
<br>
<br>
{|class="download-box"
{|class="download-box"
Line 11: Line 11:
* [[Media:2023.1_Wiki_Mixed-Classification_Project.zip]]
* [[Media:2023.1_Wiki_Mixed-Classification_Project.zip]]
|}
|}
== Glossary ==
<u><big>'''AND'''</big></u>: {{#lst:Glossary|AND}}
<u><big>'''Batch Process Step'''</big></u>: {{#lst:Glossary|Batch Process Step}}
<u><big>'''Batch'''</big></u>: {{#lst:Glossary|Batch}}
<u><big>'''Classification Method'''</big></u>: {{#lst:Glossary|Classification Method}}
<u><big>'''Classification'''</big></u>: {{#lst:Glossary|Classification}}
<u><big>'''Classify'''</big></u>: {{#lst:Glossary|Classify}}
<u><big>'''Combine'''</big></u>: {{#lst:Glossary|Combine}}
<u><big>'''Combined Methods'''</big></u>: {{#lst:Glossary|Combined Methods}}
<u><big>'''Content Model'''</big></u>: {{#lst:Glossary|Content Model}}
<u><big>'''Document Type'''</big></u>: {{#lst:Glossary|Document Type}}
<u><big>'''Extract'''</big></u>: {{#lst:Glossary|Extract}}
<u><big>'''Lexical'''</big></u>: {{#lst:Glossary|Lexical}}
<u><big>'''List Match'''</big></u>: {{#lst:Glossary|List Match}}
<u><big>'''Project'''</big></u>: {{#lst:Glossary|Project}}
<u><big>'''Vertical Wrap'''</big></u>: {{#lst:Glossary|Vertical Wrap}}


== ABOUT ==
== ABOUT ==
When classifying a '''Batch''' with varying types of '''Documents''', it's inevitable that you'll need to rely on more than just the '''''Classification Method''''' selected on the '''Content Model'''; especially when it comes to training-based '''''Classification Methods'''''. Grooper classifies '''Documents''' based on a confidence variable called a Similarity Score. If "Document Type A" scores higher than "Document Type B", then the classification for "Document Type A" wins, regardless of whether or not that classification is actually correct.
When classifying a '''Batch''' with varying types of '''Documents''', you may need to rely on more than just the Classify Method selected for the '''Content Model'''. This is often common when it comes to training-based Classify Methods (like the '''''[[Lexical]]''''' method). Grooper classifies '''Documents''' based on a confidence variable called a "Similarity Score". If "Document Type A" scores higher than "Document Type B", then the classification for "Document Type A" wins, regardless of whether or not that classification is actually correct.


Thankfully, you can configure a rules-based Extractor on a particular '''Document Type''' to work in tandem with the training-based '''''Classification Method'''''. This "Mixed Classification" approach ensures that not only does the '''Batch''' get classified as a whole, but that false positives are avoided and documents that would have been misclassified are assigned the proper '''Document Type''' via the '''''Positive Extractor'''''.
Thankfully, you can configure a "rules-based" extractor on a particular '''Document Type''' to work in tandem with the training-based Classify Methods. This "Mixed Classification" approach ensures not only does the '''Batch''' get classified as a whole but false positives are also avoided. Documents that would have been misclassified are assigned the proper '''Document Type''' via that '''Document Type's''' '''''Positive Extractor'''''.


== The Problem ==
== The Problem ==
In this example, we have a '''Batch''' of documents that were classified using the ''Lexical Classification Method''. Everything was classified properly, except for one document in the "Assignment" folder.  
In this example, we have a '''Batch''' of documents that were classified using the Lexical method. Everything was classified properly, except for one document in the "Assignment" folder.  
<br>
<br>
<br>
<br>
Line 59: Line 28:


== The Solution: Mixing Classification ==
== The Solution: Mixing Classification ==
So, we have an Assignment that was misclassified as a Memo. Training the document and reclassification could cause more issues in the long run, so what's to be done? This is where the concept of Mixing Classification comes in. We'll configure the '''''Positive Extractor''''' on the '''Document Type''' and have it work in tandem with the '''''Classification Method''''' so that every Document is properly classified.
So, we have an Assignment that was misclassified as a Memo. Training the document and reclassification could cause more issues in the long run, so what's to be done? This is where the concept of Mixing Classification comes in. We'll configure the '''''Positive Extractor''''' on the '''Document Type''' and have it work in tandem with the Classify Method so that every Document is properly classified.
<br>
<br>
<br>
<br>
# Select the '''Document Type''' which the problem Document was supposed to be classified as. In our case, that would be the "Assignment" '''Document Type'''.
# Select the '''Document Type''' which the problem Document was supposed to be classified as. In our case, that would be the "Assignment" '''Document Type'''.
# Remain on the Document Type tab.
# Remain on the Document Type tab.
# In addition to the '''''Classification Method''''' one can choose on the '''''Content Model''''', the '''Document Type''' has its own Classification section. Here, one can configure both '''''Positive''''' and '''''Negative Extractors'''''. These '''''Extractors''''' work by either positively identifying data that you want to be extracted and thus used to classify a document as a particular '''Document Type''', or to exclude text data from classification. Here, we'll be configuring the '''''Positive Extractor'''''. Select the hamburger icon at the far right of the property to expand the drop-down menu.
# In addition to the Classify Method chosen for the '''''Content Model''''', the '''Document Type''' has its own Classification section. Here, one can configure both '''''Positive''''' and '''''Negative Extractors'''''. These '''''Extractors''''' work by either positively identifying data that you want to be extracted and thus used to classify a document as a particular '''Document Type''', or to exclude text data from classification. Here, we'll be configuring the '''''Positive Extractor'''''. Select the hamburger icon at the far right of the property to expand the drop-down menu.
# Select List Match.  
# Select List Match.  
[[File:2023.1_Mixed_Classification_03_The_Solution_Mixed_Classification_01(1).png]]
[[File:2023.1_Mixed_Classification_03_The_Solution_Mixed_Classification_01(1).png]]
Line 86: Line 55:
<br>
<br>
<br>
<br>
As you can see, by combining both the chosen '''''Classification Method''''' with the configured ''''Positive Extractor''''' on the '''Document Type''', we can properly classify problematic documents that just the '''''Classification Method''''' or '''''Positive Extractor''''' alone would not be able to do. By mixing the rules-based '''''Positive Extractor''''' with the training-based Lexical '''''Classification Method''''', we have ensured that each document is assigned the correct '''Document Type'''.
As you can see, by combining both the chosen Classify Method with the configured ''''Positive Extractor''''' on the '''Document Type''', we can properly classify problematic documents that just the Classify Method or '''''Positive Extractor''''' alone would not be able to do. By mixing the rules-based '''''Positive Extractor''''' with the training-based Lexical method, we have ensured that each document is assigned the correct '''Document Type'''.

Latest revision as of 16:33, 12 May 2025

This article is about an older version of Grooper.

Information may be out of date and UI elements may have changed.

20252023.1

"Mixed Classification" refers to leveraging a Classify Method and "rules" defined on a description Document Type to overcome the shortcomings of an individual method.


You may download the ZIP(s) below and upload it into your own Grooper environment (version 2023.1). The first contains one or more Batches of sample documents. The second contains one or more Projects with resources used in examples throughout this article.

ABOUT

When classifying a Batch with varying types of Documents, you may need to rely on more than just the Classify Method selected for the Content Model. This is often common when it comes to training-based Classify Methods (like the Lexical method). Grooper classifies Documents based on a confidence variable called a "Similarity Score". If "Document Type A" scores higher than "Document Type B", then the classification for "Document Type A" wins, regardless of whether or not that classification is actually correct.

Thankfully, you can configure a "rules-based" extractor on a particular Document Type to work in tandem with the training-based Classify Methods. This "Mixed Classification" approach ensures not only does the Batch get classified as a whole but false positives are also avoided. Documents that would have been misclassified are assigned the proper Document Type via that Document Type's Positive Extractor.

The Problem

In this example, we have a Batch of documents that were classified using the Lexical method. Everything was classified properly, except for one document in the "Assignment" folder.

  1. Here, we have a document that was supposed to be classified as an 'Assignment' Document Type, but has been misclassified as a 'Memo' Document Type.
  2. This is due to the Similarity Score for Memo coming in at 61%, which is higher than the Assignment score that only came in at 55%. Unfortunately, since the incorrect Document Type scored higher, this document was classified as a Memo.
  3. We could train the Document as the "Assignment" Document Type and reclassify everything once more, but since the only real "Assignment" part of the document is the small, highlighted portion, it could cause issues for classification overall.
    • In fact, since this particular document bears a striking resemblance to a memo, it might mess up classification with the memos as well. So, training and reclassification is out of the question.

The Solution: Mixing Classification

So, we have an Assignment that was misclassified as a Memo. Training the document and reclassification could cause more issues in the long run, so what's to be done? This is where the concept of Mixing Classification comes in. We'll configure the Positive Extractor on the Document Type and have it work in tandem with the Classify Method so that every Document is properly classified.

  1. Select the Document Type which the problem Document was supposed to be classified as. In our case, that would be the "Assignment" Document Type.
  2. Remain on the Document Type tab.
  3. In addition to the Classify Method chosen for the Content Model, the Document Type has its own Classification section. Here, one can configure both Positive and Negative Extractors. These Extractors work by either positively identifying data that you want to be extracted and thus used to classify a document as a particular Document Type, or to exclude text data from classification. Here, we'll be configuring the Positive Extractor. Select the hamburger icon at the far right of the property to expand the drop-down menu.
  4. Select List Match.




  1. In Local Entries, enter the title of the Document, "ASSIGNMENT OF OIL AND GAS LEASE"
  2. Since the title wraps around, we have Vertical Wrap enabled over on the Properties tab. This is what guarantees the title will be extracted. Otherwise, it would not be picked up.
  3. With that done, click OK.



With the Positive Extractor' configured, save and go back to the Classify Batch Process Step.

The Result

Now, we'll re-test classification on the contents of the Assignment folder, and our misclassified document will be corrected.

  1. Voila! Thanks to the configurations on the Positive Extractor, our Document has now been properly classified as an "Assignment" Document Type.



As you can see, by combining both the chosen Classify Method with the configured 'Positive Extractor on the Document Type, we can properly classify problematic documents that just the Classify Method or Positive Extractor alone would not be able to do. By mixing the rules-based Positive Extractor with the training-based Lexical method, we have ensured that each document is assigned the correct Document Type.