2023:Confidence Multiplier and Output Confidence (Property): Difference between revisions

From Grooper Wiki
No edit summary
No edit summary
 
(17 intermediate revisions by 3 users not shown)
Line 1: Line 1:
{{AutoVersion}}
[[File:weighted_rules_00.png|right|thumb|250px|Graphic depicting the notion of Weighted Rules.]]
[[File:weighted_rules_00.png|right|thumb|250px|Graphic depicting the notion of Weighted Rules.]]


<blockquote style="font-size:14pt">
<blockquote>{{#lst:Glossary|Confidence Multiplier and Output Confidence}}</blockquote>
Some results carry more weight than others.
 
</blockquote>
{|class="download-box"
|
[[File:Asset 22@4x.png]]
|
You may download the ZIP(s) below and upload it into your own Grooper environment (version 2023).  The first contains a '''Project''' with resources used in examples throughout this article.  The second contains one or more '''Batches''' of sample documents.
* [[Media:2023 Wiki Confidence-Multiplier-and-Output Confidence Project.zip]]
* [[Media:2023 Wiki Confidence-Multiplier-and-Output Confidence Batch.zip]]
|}


==About==
==About==
Line 24: Line 33:
For example, a value of ''75%'' will change the confidence of output results to 75%.  If the output confidence was 100%, now it will be 75%.  If the output confidence was 50%, now it will be 75%.  If it was 75%, it will now be (you guessed it) 75%.  It doesn't matter what the original confidence was, it will be transformed to the '''''Output Confidence''''' value.
For example, a value of ''75%'' will change the confidence of output results to 75%.  If the output confidence was 100%, now it will be 75%.  If the output confidence was 50%, now it will be 75%.  If it was 75%, it will now be (you guessed it) 75%.  It doesn't matter what the original confidence was, it will be transformed to the '''''Output Confidence''''' value.


==How To==
==Waterfall Classification==
 
===Waterfall Classification===
Setting the '''''Classification Method''''' property on a '''Content Model''' to ''Lexical'' or ''Rules-Based'', one can set up ''Positive Extractors'' on '''Document Types'''.  If this extractor returns a result above the '''''Minimum Similarity''''' set on the '''Content Model''', the document will be assigned that '''Document Type''' during classification. By default a result from an extractor is returned at 100% confidence (unless '''[[Fuzzy RegEx]]''' is leveraged to return a result, in which case the confidence will be affected by the fuzzy algorithm.) Given this fact positive extractors are almost certain to be above the '''''Minimum Similarity'''''.
Setting the '''''Classification Method''''' property on a '''Content Model''' to ''Lexical'' or ''Rules-Based'', one can set up ''Positive Extractors'' on '''Document Types'''.  If this extractor returns a result above the '''''Minimum Similarity''''' set on the '''Content Model''', the document will be assigned that '''Document Type''' during classification. By default a result from an extractor is returned at 100% confidence (unless '''[[Fuzzy RegEx]]''' is leveraged to return a result, in which case the confidence will be affected by the fuzzy algorithm.) Given this fact positive extractors are almost certain to be above the '''''Minimum Similarity'''''.


Line 38: Line 45:


====Example====
====Example====
A base '''Content Model''', '''[[Batch]]''', and '''[[Batch Process]]''' for use with this section can be found '''[[Media:Waterfall_Classification.zip|here]]'''. It is not required to download to understand this section, but can be helpful because it can be used to follow along with the content of this section. ''This file was exported from and meant for use in Grooper 2.9''
In the example below, we are going to use the '''Project''' and '''Batch''' that accompany the Document Classification 2023 course on Grooper University.  
{| class="wikitable"
{|cellpadding=10 cellspacing=5
| style="padding:25px; width:35%" |
| style="vertical-align:top; width:40%" |
# Here selected is the '''Content Model''' for this setup.
 
# The '''''Classification Method''''' property is set to ''Lexical'' which allows for TF-IDF training, as well as a ''Rulese Based'' approach.
# Right now we're looking at how documents are currently being classified by working in a Classify '''Batch Process Step'''.  
#* Notice, too, its '''''Minimum Similarity''''' property is at its default ''60%''.
# We see that this Title Opinion is being misclassified as a Generic Letter.  
|| [[Image:Waterfall_classification_01.png|center]]
# Notice that the document has a similarity score of 100% for the Generic Letter '''Document Type''' and a 68% score for the Title Opinion '''Document Type'''.  
|
[[File:2023 Confidence Multiplier and Output Confidence - 2023 01 How To 01 Waterfall Classification 01.png|center]]
|-
|-
| style="padding:25px; width:35%" |
|valign=top|
# <li value=3> This '''Data Type''' is the extractor supplying the '''Content Model's''' '''''Feature Extractor''''' property, which is what is used to train the '''Form Types''' of the '''Title Opinion''' '''Document Type'''.</li>
# If we go to the '''Content Model'''...
|| [[Image:Waterfall_classification_02.png|center]]
# We can see that our '''''Minimum Similarity''''' property is set to 55%.
#* Both the Generic Letter and the Title Opinion '''Document Types''' came in at above the '''''Minimim Similarity''''' percentage, but the Generic Letter won out at a higher percentage.  
|
[[File:2023 Confidence Multiplier and Output Confidence - 2023 01 How To 01 Waterfall Classification 02.png|center]]
|-
|-
| style="padding:25px; width:35%" |
|valign=top|
# <li value=4> This '''Data Type''' is the extractor supplying the '''Generic Letter''' '''Document Type's''' positive Extractor property.</li>
# Let's look at the Generic Letter '''Document Type'''.  
# It is using the the (new to Grooper 2.9) ''AND'' '''[[Collation Method]]'''.
# The '''''Positive Extractor''''' is set to a reference.  
#* Think of this as somewhat between an ''Array'' and an ''Ordered Array''. All extraction results need to be present, but not in a specific order.
|
# The '''''Result Options''''' property&rsquo;s sub '''Result Options''' window has the '''''Output Confidence''''' property set to 60%, therefore all results returned from this extractor will be returend with a confidence of 60%.
[[File:2023 Confidence Multiplier and Output Confidence - 2023 01 How To 01 Waterfall Classification 03.png|center]]
|| [[Image:Waterfall_classification_03.png|center]]
|-
|-
| style="padding:25px; width:35%" |
|valign=top|
# <li value=7> This '''Batch Process Step''' is configured for ''ESP Auto Separation'' at the ''Batch'' '''''Scope''''' and pointed at the aforementioned '''Content Model'''. The '''Batch''' supplied with the zip file is the one being observed.</li>
# Let's look at the extractor that is being referenced.
# In the highlighted example, with the '''Pages''' classified and the '''Preview''' button enabled, you can see that the '''Title Opinion''' '''Document Type''' similarity for page one is 89%, and the '''Generic Letter''' is coming in at the enforced 60%.
# We're going to scroll down to the "OUTPUT" section in the '''Data type''' "Properties" tab, and click the ellipsis button next to '''''Result Options'''''.
#* Were this a previous version of '''Grooper''' without this functionality, the "Rule" would have come in at a default 100%, which is obviously higher than the 89%, because confidences could not be manipulated previously. This would have resulted in the false classification of the '''Tile Opinion''' as a '''Generic Letter'''.
|
# The '''Positive Extractor''' of the '''Generic Letter''' allows the accurate classification of the letters given the "rule" is at or above the '''Content Model's''' '''''Minimum Similarity''''' property.
[[File:2023 Confidence Multiplier and Output Confidence - 2023 01 How To 01 Waterfall Classification 04.png|center]]
|| [[Image:Waterfall_classification_04.png|center]]
|-
|valign=top|
# When the "Result Options" window pops up, we see that by default the '''''Confidence Override''''' is set to 0%.  
# If we set this property to anything other than 0%, when a document is classified, whatever '''Document Type''' is using this extractor will have a similarity score no higher than that number.
|
[[File:2023 Confidence Multiplier and Output Confidence - 2023 01 How To 01 Waterfall Classification 05.png|center]]
|-
|valign=top|
# We're going to set the '''''Confidence Override''''' to 60%.
# Click "OK" to apply the new settings.
|
[[File:2023 Confidence Multiplier and Output Confidence - 2023 01 How To 01 Waterfall Classification 06.png|center]]
|-
|valign=top|
# With our settings updated, let's go back to the Classify '''Batch Process Step'''.
# On the "Classification Tester" tab we have reclassified the documents.
# Notice that the Title Opinion document is now being classified appropriately.
# The Title Opinion '''Document Type''' is still coming in at 68%. However, the Generic Letter '''Document Type''' is returning with a 60% similarity score due to the '''''Confidence Override''''' property we set.  
|
[[File:2023 Confidence Multiplier and Output Confidence - 2023 01 How To 01 Waterfall Classification 07.png|center]]
|-
|valign=top|
# This Generic Letter is still being classified as a Generic Letter '''Document Type'''.
# We see that although the Generic '''Document Type''' has a 60% similarity score, it is still higher than the '''''Minimum Similiarity''''' score of 55% and it is also higher than any other '''Document Type'''
|
[[File:2023 Confidence Multiplier and Output Confidence - 2023 01 How To 01 Waterfall Classification 08.png|center]]
|}
|}
==Version Differences==
Prior to '''Grooper''' 2.9 the '''''Confidence Multiplier''''' property did not exist.
[[Category:Articles]]
[[Category:Version 2.90]]

Latest revision as of 12:49, 21 November 2024

This article is about an older version of Grooper.

Information may be out of date and UI elements may have changed.

202520232.90
Graphic depicting the notion of Weighted Rules.

Some results carry more weight than others. The Confidence Multiplier and Output Confidence properties allow you to manually adjust an extraction result's confidence.

You may download the ZIP(s) below and upload it into your own Grooper environment (version 2023). The first contains a Project with resources used in examples throughout this article. The second contains one or more Batches of sample documents.

About

The Confidence Multiplier and Output Confidence properties of Data Type and Data Format extractors allow you to manually alter the confidence score of returned values.

Use of these properties is sometimes referred to as weighted rules. Its practical application allows a user to increase or decrease the confidence score of an extractor's result (or set its confidence to an assigned value). This changes the confidence of the extractor's results, making them appear more (or less) favorable. When used in combination with the Order By property set to Confidence on a parent Data Type, you can manipulate which child extractor's result the parent prioritizes.

General Usage - Confidence Multiplier

Modifying the Confidence Multiplier property of a Data Type or Data Format is done by clicking on the ellipses in the Result Options property which opens the Result Options submenu.

The Confidence Multiplier property defaults to 1 and can be changed in this submenu. The field is a double and takes floating point values.

For example, a value of 0.5 will multiply the confidence of output results by 0.5. If the output confidence was 100%, now it will be 50%. Similarly, you can increase the confidence, even above 100%. If the Confidence Multiplier property is set to 3, and an output result had a 50% confidence, it will not display as 150% confidence.

General Usage - Output Confidence

Modifying the Output Confidence property of a Data Type or Data Format is also done by clicking on the ellipses in the Result Options property which opens the Result Options submenu.

The Output Confidence property defaults to 0% and can be changed in this submenu. The default of 0% will not alter the results confidence scores. Changing this number will override whatever the result's original confidence is and replace it with this value.

For example, a value of 75% will change the confidence of output results to 75%. If the output confidence was 100%, now it will be 75%. If the output confidence was 50%, now it will be 75%. If it was 75%, it will now be (you guessed it) 75%. It doesn't matter what the original confidence was, it will be transformed to the Output Confidence value.

Waterfall Classification

Setting the Classification Method property on a Content Model to Lexical or Rules-Based, one can set up Positive Extractors on Document Types. If this extractor returns a result above the Minimum Similarity set on the Content Model, the document will be assigned that Document Type during classification. By default a result from an extractor is returned at 100% confidence (unless Fuzzy RegEx is leveraged to return a result, in which case the confidence will be affected by the fuzzy algorithm.) Given this fact positive extractors are almost certain to be above the Minimum Similarity.

This extractor could be a "Waterfall Extractor", taking advantage of the Waterfall Extraction technique. However, for classification, the system is just looking for some result to be returned above the Minimum Similarity confidence threshold.

In the Waterfall Classification method, the Minimum Confidence property can be set in the Result Filter property window of a Data Type which will eliminate any results less than that confidence. This may eliminate the results of some referenced extractors which technically matched, but at a low percent.

If we happen to know that those lower confidence hits are valid and should count for classifying the document, then the Confidence Multipliers on those referenced Data Types can be set to a higher value in order to make them hit the Minimum Confidence required.

Similarly, if higher confidence hits are inappropriately classifying documents and shouldn't be returned, the Confidence Multiplier property can be reduced so that those Data Types only exceed the Minimum Confidence when they are very high confidence.

Example

In the example below, we are going to use the Project and Batch that accompany the Document Classification 2023 course on Grooper University.

  1. Right now we're looking at how documents are currently being classified by working in a Classify Batch Process Step.
  2. We see that this Title Opinion is being misclassified as a Generic Letter.
  3. Notice that the document has a similarity score of 100% for the Generic Letter Document Type and a 68% score for the Title Opinion Document Type.
  1. If we go to the Content Model...
  2. We can see that our Minimum Similarity property is set to 55%.
    • Both the Generic Letter and the Title Opinion Document Types came in at above the Minimim Similarity percentage, but the Generic Letter won out at a higher percentage.
  1. Let's look at the Generic Letter Document Type.
  2. The Positive Extractor is set to a reference.
  1. Let's look at the extractor that is being referenced.
  2. We're going to scroll down to the "OUTPUT" section in the Data type "Properties" tab, and click the ellipsis button next to Result Options.
  1. When the "Result Options" window pops up, we see that by default the Confidence Override is set to 0%.
  2. If we set this property to anything other than 0%, when a document is classified, whatever Document Type is using this extractor will have a similarity score no higher than that number.
  1. We're going to set the Confidence Override to 60%.
  2. Click "OK" to apply the new settings.
  1. With our settings updated, let's go back to the Classify Batch Process Step.
  2. On the "Classification Tester" tab we have reclassified the documents.
  3. Notice that the Title Opinion document is now being classified appropriately.
  4. The Title Opinion Document Type is still coming in at 68%. However, the Generic Letter Document Type is returning with a 60% similarity score due to the Confidence Override property we set.
  1. This Generic Letter is still being classified as a Generic Letter Document Type.
  2. We see that although the Generic Document Type has a 60% similarity score, it is still higher than the Minimum Similiarity score of 55% and it is also higher than any other Document Type