2.90:Confidence Multiplier and Output Confidence (Property): Difference between revisions

From Grooper Wiki
No edit summary
No edit summary
 
(23 intermediate revisions by 2 users not shown)
Line 1: Line 1:
{{AutoVersion}}
[[File:weighted_rules_00.png|right|thumb|250px|Graphic depicting the notion of Weighted Rules.]]
[[File:weighted_rules_00.png|right|thumb|250px|Graphic depicting the notion of Weighted Rules.]]


<blockquote style="font-size:14pt">
<blockquote>{{#lst:Glossary|Confidence Multiplier and Output Confidence}}</blockquote>
Some results carry more weight than others.
</blockquote>


==About==
==About==
''Weighted rules'' is an informal title given to the practical application of the '''''Confidence Multiplier''''' property of a '''Data Type'''. Its practical application allows a user to arbitrarily set the confidence of a result of a particular '''Data Type''' in order to allow that '''Data Type''' to appear more (or less) favorable to a parent '''Data Type''' that is leveraging the '''''Order By''''' property configured to the ''Confidence'' setting.
The '''''Confidence Multiplier''''' and '''''Output Confidence''''' properties of '''[[Data Type]]''' and '''[[Data Format]]''' extractors allow you to manually alter the confidence score of returned values.
 
Use of these properties is sometimes referred to as ''weighted rules''.  Its practical application allows a user to increase or decrease the confidence score of an extractor's result (or set its confidence to an assigned value).  This changes the confidence of the extractor's results, making them appear more (or less) favorable.  When used in combination with the '''''Order By''''' property set to ''Confidence'' on a parent '''Data Type''', you can manipulate which child extractor's result the parent prioritizes.
 
===General Usage - Confidence Multiplier===
Modifying the '''''Confidence Multiplier''''' property of a '''Data Type''' or '''Data Format''' is done by clicking on the ellipses in the '''''Result Options''''' property which opens the '''''Result Options''''' submenu.  


<br clear = all>
The '''''Confidence Multiplier''''' property defaults to ''1'' and can be changed in this submenu. The field is a double and takes floating point values. 


==Use Cases==
For example, a value of ''0.5'' will multiply the confidence of output results by 0.5. If the output confidence was 100%, now it will be 50%. Similarly, you can increase the confidence, even above 100%. If the '''''Confidence Multiplier''''' property is set to ''3'', and an output result had a 50% confidence, it will not display as 150% confidence.
''Weighted Rules'' can be used in cases where one is trying to find an element of data which can appear on many similar types of forms that do not have a consistent method to identify where the data is.<br/>
 
For Example, on different forms, the best method to pick up a piece of data may be a ''Key-Value Pair'', a '''Field Class''', a simple pattern match, a pattern match leveraging ''FuzzyRegEx'', or some other method.<br/>
===General Usage - Output Confidence===
One of the more recent methodologies for incorporating multiple extractors to be used by a single field has been coloquially referred to as ''Waterfall Extraction''. This is done by organizing myriad extractors (and their numerous configurations) under a parent '''Data Type'''. The '''''Order By''''' property of the parent '''Data Type''' can then be set to the following: ''Position'', ''Frequency'', ''Confidence'', ''Extractor'', ''Length'', ''Value''.<br/>
Modifying the '''''Output Confidence''''' property of a '''Data Type''' or '''Data Format''' is also done by clicking on the ellipses in the '''''Result Options''''' property which opens the '''''Result Options''''' submenu.
Setting '''Order By''' to ''Confidence'' may be an interesting way to organize results, but typically, properly configured extractors always return their results at 100%. The confidence of a returned result has, historically, only been affected in one of two ways:<br/>
 
# '''Data Type''''s (or a child '''Data Format''''s) regular expression pattern leverages ''FuzzyRegEx'' and it, as a result, had to insert, delete, or swap a character to match the pattern, thus generating a result less than 100% confident
The '''''Output Confidence''''' property defaults to ''0%'' and can be changed in this submenu. The default of ''0%'' will not alter the results confidence scores.  Changing this number will override whatever the result's original confidence is and replace it with this value.
# ''Field Class''es, by design leverage trained/weighted features and should not return results at 100% confidence<br/>
 
Considering this, a properly configured extractor can, and does, return results below 100%, and thus breaks the logical approach of organizing results by confidence. To elaborate, a result returned at 90% confidence could be more desirable than one returned at 100%.<br/>
For example, a value of ''75%'' will change the confidence of output results to 75%.  If the output confidence was 100%, now it will be 75%.  If the output confidence was 50%, now it will be 75%. If it was 75%, it will now be (you guessed it) 75%.  It doesn't matter what the original confidence was, it will be transformed to the '''''Output Confidence''''' value.
Let's explore how and why.


==How To==
==How To==
Here we'll explore a use case using a mortgage document.
===Waterfall Extraction===
''Weighted Rules'' can be used in cases where one is trying to find a data element appearing on many similar types of forms but multiple extraction approaches are required to identify the element.
 
For Example, on different forms, the best method to pick up a piece of data may be a ''[[Key-Value Pair (Collation Provider)|Key-Value Pair]]'', a '''[[Field Class]]''', a simple pattern match, a pattern match leveraging ''[[FuzzyRegEx]]'', or some other method.
 
One technique for incorporating multiple extractors to return a single field value is referred to as "Waterfall Extraction".  The Waterfall Extraction technique is a method to select a single result from multiple extractor results, according to some specific criteria.  First, multiple extractors (and their numerous configurations) are organized under a parent '''Data Type'''. The extractors results are prioritized according to the '''''Order By''''' property on the parent '''Data Type'''.  The '''''Order By''''' property of the parent '''Data Type''' can be set to the following: ''Position'', ''Frequency'', ''Confidence'', ''Extractor'', ''Length'', ''Value''.
 
Setting '''''Order By''''' to ''Confidence'' can prioritize the most confident extractor result.  This can prioritize "non-Fuzzy" results "Fuzzy" results or the most confident result of the child extractors leveraging ''FuzzyRegEx''.  However, extractors using traditional, non-Fuzzy regex always return their results at 100%. The confidence of a returned result has, historically, ''only'' been affected in one of two ways:
# '''Data Types''' (or a child '''Data Format''''s) regular expression pattern leverages ''FuzzyRegEx''. Characters are mutated to match the pattern, either inserted, deleted, or swapped.  Each mutation comes at the cost of the result's overall confidence, generating a result less than 100% confident.
# '''Field Classes''', by design leverage trained/weighted features and should not return results at 100% confidence.
Considering this, a properly configured extractor can, and does, return results below 100%, and can break the logical approach of organizing results by confidence. A result returned at 90% confidence ''could'' be more desirable than one returned at 100%.


<tabs style="margin:20px">
<tabs style="margin:20px">
<tab name="OCR Misread" style="margin:25px">
<tab name="OCR Misread" style="margin:25px">
====OCR Misread====
In this example, an OCR error produced a misread the words “final loan” by not recognizing the space between them.
In this example, an OCR error produced a misread the words “final loan” by not recognizing the space between them.
[[File:weighted_rules_01.png|center]]<br/>
[[File:weighted_rules_01.png|center]]<br/>
Line 30: Line 46:
</tab>
</tab>
<tab name="Child Data Type Setup" style="margin:25px">
<tab name="Child Data Type Setup" style="margin:25px">
====Child Data Type Setup====
Three '''Data Types''' were established to find variations of a result.
Three '''Data Types''' were established to find variations of a result.
{| class="wikitable" style="text-align:center"
{| class="wikitable"
! style: "padding: 25px" | FinalLoan
| style="padding:25px" |
! Final Loan
'''FinalLoan'''
! Fuzzy: Final Loan
 
A '''Data Type''' which uses a regular expression looking for the expression “finalloan” with no spaces.
|| [[Image:weighted_rules_03.png]]
|-
|-
| A '''Data Type''' which uses a regular expression looking for the expression “finalloan” with no spaces. || A '''Data Type''' which uses a regular expression looking for the expression “final loan” with the space. || A '''Data Type''' which uses a fuzzy regular expression looking for the expression “final loan” with the space.
| style="padding:25px" |
'''Final Loan'''
 
A '''Data Type''' which uses a regular expression looking for the expression “final loan” with the space.
|| [[Image:weighted_rules_04.png]]
|-
|-
| style="padding: 10px" | [[File:weighted_rules_03.png|center]] || [[File:weighted_rules_04.png|center]] || [[File:weighted_rules_05.png|center]]
| style="padding:25px" |
'''Fuzzy: Final Loan'''
 
A '''Data Type''' which uses a fuzzy regular expression looking for the expression “final loan” with the space.
|| [[Image:weighted_rules_05.png]]
|}
|}
</tab>
</tab>
<tab name="Waterfall Extractor" style="margin:25px">
<tab name="Waterfall Extractor" style="margin:25px">
====Waterfall Extractor====
The ''Waterfall Extractor'' is a '''Data Type''' that is a parent or references all of the unique extractors for a piece of data and then determines which one should be given as a final output to a '''Data Field'''.
The ''Waterfall Extractor'' is a '''Data Type''' that is a parent or references all of the unique extractors for a piece of data and then determines which one should be given as a final output to a '''Data Field'''.
[[File:weighted_rules_06.png|center]]
[[File:weighted_rules_06.png|center]]
</tab>
</tab>
<tab name="Default Output" style="margin:25px">
<tab name="Default Output" style="margin:25px">
====Default Output====
Using '''Order By''' set to ''Confidence'' and '''Direction''' set to ''Descending'' as the sort criteria, two extractors match with the highest confidence result given first. The ''FinalLoan'' extractor matched because it found “finalloan” with no spaces and it is not leveraging ''FuzzyRegEx'', so it matched at 100%. The ''Final Loan'' extractor did not match, because it is not using ''FuzzyRegEx'' and it did not find a space between the two words so it did not consider it a match. The ''Fuzzy: Final Loan'', leveraging ''FuzzyRegEx'', matched because it was able to make the word “finalloan” into “final loan” by inserting a space and so it was a 90% match.
Using '''Order By''' set to ''Confidence'' and '''Direction''' set to ''Descending'' as the sort criteria, two extractors match with the highest confidence result given first. The ''FinalLoan'' extractor matched because it found “finalloan” with no spaces and it is not leveraging ''FuzzyRegEx'', so it matched at 100%. The ''Final Loan'' extractor did not match, because it is not using ''FuzzyRegEx'' and it did not find a space between the two words so it did not consider it a match. The ''Fuzzy: Final Loan'', leveraging ''FuzzyRegEx'', matched because it was able to make the word “finalloan” into “final loan” by inserting a space and so it was a 90% match.
[[File:weighted_rules_07.png|center]]
[[File:weighted_rules_07.png|center]]
Line 55: Line 85:
</tab>
</tab>
<tab name="Getting the Desired Result" style="margin:25px">
<tab name="Getting the Desired Result" style="margin:25px">
====Getting the Desired Result====
Let's change some settings to set this extractor up to return the results in the desired way; that being with the most right result ''weighted'' the highest.
Let's change some settings to set this extractor up to return the results in the desired way; that being with the most right result ''weighted'' the highest.
{| class="wikitable" style="text-align:center"
{| class="wikitable"
| Reset the '''''Confidence Multiplier''''' property in the '''Result Options''' property window for the '''Fuzzy: Final Loan''' '''Data Type'''. || Set the '''''Confidence Multiplier''''' property in the '''Result Options''' property window for the '''FinalLoan''' '''Data Type''' to 0.75. The results on the parent '''Data Type''' will now show the ''un-weighted'' '''Data Type''' Fuzzy: Final Loan''' at a confidence of 90% (again, because a space was inserted), and the '''FinalLoan''' '''Data Type''' will show 75%.
| style="padding:25px" |
<br/>
Reset the '''''Confidence Multiplier''''' property in the '''Result Options''' property window for the '''Fuzzy: Final Loan''' '''Data Type'''.
|| [[Image:weighted_rules_09.png]]
|-
| style="padding:25px; width:50%" |
Set the '''''Confidence Multiplier''''' property in the '''Result Options''' property window for the '''FinalLoan''' '''Data Type''' to 0.75. The results on the parent '''Data Type''' will now show the ''un-weighted'' '''Data Type''' '''Fuzzy: Final Loan''' at a confidence of 90% (again, because a space was inserted), and the '''FinalLoan''' '''Data Type''' will show 75%.
 
In the event another document is OCRed correctly with a space between the words, the '''Final Loan''' '''Data Type'''  would return the exact match at 100%. The '''Fuzzy: Final Loan''' '''Data Type''' would also return 100% because the expression matched 100% with no substitutions.
In the event another document is OCRed correctly with a space between the words, the '''Final Loan''' '''Data Type'''  would return the exact match at 100%. The '''Fuzzy: Final Loan''' '''Data Type''' would also return 100% because the expression matched 100% with no substitutions.
<br/>
 
In order to make the exact match always preferred, it would also be possible to set the '''Fuzzy: Final Loan''' '''Data Type''' '''Confidence Multiplier''' property to 0.99. But since both the fuzzy and the exact non-fuzzy '''Data Type''' matched 100%, it doesn’t really matter which one returns the result.
In order to make the exact match always preferred, it would also be possible to set the '''Fuzzy: Final Loan''' '''Data Type''' '''Confidence Multiplier''' property to 0.99. But since both the fuzzy and the exact non-fuzzy '''Data Type''' matched 100%, it doesn’t really matter which one returns the result.
|-
|| [[Image:weighted_rules_10.png]]
| style="padding: 10px" | [[File:weighted_rules_09.png|center]] || style="padding: 10px" | [[File:weighted_rules_10.png|center]]
|}
|}
</tab>
</tab>
</tabs>
</tabs>


===General Usage===
===Waterfall Classification===
Modifying the '''Confidence Multiplier''' property of a '''Data Type'''  is done by clicking on the ellipses in the '''Result Options''' property which opens the '''Result Options''' submenu. The '''''Confidence Multiplier''''' property defaults to 1 and can be changed in this submenu. The field is a double and takes floating point values, so you can use a value of, for example 0.5 to multiply the confidence of output results by 0.5. If the output confidence was 100%, now it will be 50%. Similarly, you can increase the confidence, even above 100%. If the '''''Confidence Multiplier''''' property is set to 3, and an output result had a 50% confidence, it will not display as 150% confidence.
Setting the '''''Classification Method''''' property on a '''Content Model''' to ''Lexical'' or ''Rules-Based'', one can set up ''Positive Extractors'' on '''Document Types'''. If this extractor returns a result above the '''''Minimum Similarity''''' set on the '''Content Model''', the document will be assigned that '''Document Type''' during classification. By default a result from an extractor is returned at 100% confidence (unless '''[[Fuzzy RegEx]]''' is leveraged to return a result, in which case the confidence will be affected by the fuzzy algorithm.) Given this fact positive extractors are almost certain to be above the '''''Minimum Similarity'''''.
 
This extractor could be a "Waterfall Extractor", taking advantage of the Waterfall Extraction technique. However, for classification, the system is just looking for some result to be returned above the '''''Minimum Similarity''''' confidence threshold.
 
In the ''Waterfall Classification'' method, the '''''Minimum Confidence''''' property can be set in the '''Result Filter''' property window of a '''Data Type''' which will eliminate any results less than that confidence. This may eliminate the results of some referenced extractors which technically matched, but at a low percent.  
 
If we happen to know that those lower confidence hits are valid and ''should'' count for classifying the document, then the '''Confidence Multipliers''' on those referenced '''Data Types''' can be set to a higher value in order to make them hit the '''Minimum Confidence''' required.  
 
Similarly, if higher confidence hits are inappropriately classifying documents and ''shouldn't'' be returned, the '''''Confidence Multiplier''''' property can be reduced so that those '''Data Types''' only exceed the '''''Minimum Confidence''''' when they are very high confidence.


===In Context - Waterfall Classification===
====Example====
Setting the '''''Classification Method''''' property on a '''Content Model''' to ''Rules-Based'', one can set up '''Data Types''' as ''Positive Extractors'' and ''Negative Extractors'', either of which can be ''Waterfall Extractors'' just by having child '''Data Types''' or referencing other '''Data Types''' or '''Field Classes'''. In this case, the system is looking for some result to be returned. In the ''Waterfall Classification'' method, the '''''Minimum Confidence''''' property can be set in the '''Result Filter''' property window of a '''Data Type''' which will eliminate any results less than that confidence. This may eliminate the results of some referenced extractors which technically matched, but at a low percent. If we happen to know that those low percentage match hits are valid, then the '''Confidence Multipliers''' on those referenced '''Data Types''' can be set to a higher value in order to make them hit the '''Minimum Confidence''' required. Similarly, if it is desired to discount a high confidence of some extractors which are hitting on the wrong '''Document Type''', the '''''Confidence Multiplier''''' property can be reduced so that those '''Data Types''' only exceed the Minimum Confidence when they are very high confidence.
A base '''Content Model''', '''[[Batch]]''', and '''[[Batch Process]]''' for use with this section can be found '''[[Media:Waterfall_Classification.zip|here]]'''. It is not required to download to understand this section, but can be helpful because it can be used to follow along with the content of this section. ''This file was exported from and meant for use in Grooper 2.9''
{| class="wikitable"
| style="padding:25px; width:35%" |
# Here selected is the '''Content Model''' for this setup.
# The '''''Classification Method''''' property is set to ''Lexical'' which allows for TF-IDF training, as well as a ''Rulese Based'' approach.
#* Notice, too, its '''''Minimum Similarity''''' property is at its default ''60%''.
|| [[Image:Waterfall_classification_01.png|center]]
|-
| style="padding:25px; width:35%" |
# <li value=3> This '''Data Type''' is the extractor supplying the '''Content Model's''' '''''Feature Extractor''''' property, which is what is used to train the '''Form Types''' of the '''Title Opinion''' '''Document Type'''.</li>
|| [[Image:Waterfall_classification_02.png|center]]
|-
| style="padding:25px; width:35%" |
# <li value=4> This '''Data Type''' is the extractor supplying the '''Generic Letter''' '''Document Type's''' positive Extractor property.</li>
# It is using the the (new to Grooper 2.9) ''AND'' '''[[Collation Method]]'''.
#* Think of this as somewhat between an ''Array'' and an ''Ordered Array''. All extraction results need to be present, but not in a specific order.
# The '''''Result Options''''' property&rsquo;s sub '''Result Options''' window has the '''''Output Confidence''''' property set to 60%, therefore all results returned from this extractor will be returend with a confidence of 60%.
|| [[Image:Waterfall_classification_03.png|center]]
|-
| style="padding:25px; width:35%" |
# <li value=7> This '''Batch Process Step''' is configured for ''ESP Auto Separation'' at the ''Batch'' '''''Scope''''' and pointed at the aforementioned '''Content Model'''. The '''Batch''' supplied with the zip file is the one being observed.</li>
# In the highlighted example, with the '''Pages''' classified and the '''Preview''' button enabled, you can see that the '''Title Opinion''' '''Document Type''' similarity for page one is 89%, and the '''Generic Letter''' is coming in at the enforced 60%.
#* Were this a previous version of '''Grooper''' without this functionality, the "Rule" would have come in at a default 100%, which is obviously higher than the 89%, because confidences could not be manipulated previously. This would have resulted in the false classification of the '''Tile Opinion''' as a '''Generic Letter'''.
# The '''Positive Extractor''' of the '''Generic Letter''' allows the accurate classification of the letters given the "rule" is at or above the '''Content Model's''' '''''Minimum Similarity''''' property.
|| [[Image:Waterfall_classification_04.png|center]]
|}


==Version Differences==
==Version Differences==
Prior to '''Grooper''' 2.9 the '''''Confidence Multiplier''''' property did not exist.
Prior to '''Grooper''' 2.9 the '''''Confidence Multiplier''''' property did not exist.

Latest revision as of 11:37, 29 April 2024

This article is about an older version of Grooper.

Information may be out of date and UI elements may have changed.

202520232.90
Graphic depicting the notion of Weighted Rules.

Some results carry more weight than others. The Confidence Multiplier and Output Confidence properties allow you to manually adjust an extraction result's confidence.

About

The Confidence Multiplier and Output Confidence properties of Data Type and Data Format extractors allow you to manually alter the confidence score of returned values.

Use of these properties is sometimes referred to as weighted rules. Its practical application allows a user to increase or decrease the confidence score of an extractor's result (or set its confidence to an assigned value). This changes the confidence of the extractor's results, making them appear more (or less) favorable. When used in combination with the Order By property set to Confidence on a parent Data Type, you can manipulate which child extractor's result the parent prioritizes.

General Usage - Confidence Multiplier

Modifying the Confidence Multiplier property of a Data Type or Data Format is done by clicking on the ellipses in the Result Options property which opens the Result Options submenu.

The Confidence Multiplier property defaults to 1 and can be changed in this submenu. The field is a double and takes floating point values.

For example, a value of 0.5 will multiply the confidence of output results by 0.5. If the output confidence was 100%, now it will be 50%. Similarly, you can increase the confidence, even above 100%. If the Confidence Multiplier property is set to 3, and an output result had a 50% confidence, it will not display as 150% confidence.

General Usage - Output Confidence

Modifying the Output Confidence property of a Data Type or Data Format is also done by clicking on the ellipses in the Result Options property which opens the Result Options submenu.

The Output Confidence property defaults to 0% and can be changed in this submenu. The default of 0% will not alter the results confidence scores. Changing this number will override whatever the result's original confidence is and replace it with this value.

For example, a value of 75% will change the confidence of output results to 75%. If the output confidence was 100%, now it will be 75%. If the output confidence was 50%, now it will be 75%. If it was 75%, it will now be (you guessed it) 75%. It doesn't matter what the original confidence was, it will be transformed to the Output Confidence value.

How To

Waterfall Extraction

Weighted Rules can be used in cases where one is trying to find a data element appearing on many similar types of forms but multiple extraction approaches are required to identify the element.

For Example, on different forms, the best method to pick up a piece of data may be a Key-Value Pair, a Field Class, a simple pattern match, a pattern match leveraging FuzzyRegEx, or some other method.

One technique for incorporating multiple extractors to return a single field value is referred to as "Waterfall Extraction". The Waterfall Extraction technique is a method to select a single result from multiple extractor results, according to some specific criteria. First, multiple extractors (and their numerous configurations) are organized under a parent Data Type. The extractors results are prioritized according to the Order By property on the parent Data Type. The Order By property of the parent Data Type can be set to the following: Position, Frequency, Confidence, Extractor, Length, Value.

Setting Order By to Confidence can prioritize the most confident extractor result. This can prioritize "non-Fuzzy" results "Fuzzy" results or the most confident result of the child extractors leveraging FuzzyRegEx. However, extractors using traditional, non-Fuzzy regex always return their results at 100%. The confidence of a returned result has, historically, only been affected in one of two ways:

  1. Data Types (or a child Data Format's) regular expression pattern leverages FuzzyRegEx. Characters are mutated to match the pattern, either inserted, deleted, or swapped. Each mutation comes at the cost of the result's overall confidence, generating a result less than 100% confident.
  2. Field Classes, by design leverage trained/weighted features and should not return results at 100% confidence.

Considering this, a properly configured extractor can, and does, return results below 100%, and can break the logical approach of organizing results by confidence. A result returned at 90% confidence could be more desirable than one returned at 100%.

OCR Misread

In this example, an OCR error produced a misread the words “final loan” by not recognizing the space between them.


Child Data Type Setup

Three Data Types were established to find variations of a result.

FinalLoan

A Data Type which uses a regular expression looking for the expression “finalloan” with no spaces.

Final Loan

A Data Type which uses a regular expression looking for the expression “final loan” with the space.

Fuzzy: Final Loan

A Data Type which uses a fuzzy regular expression looking for the expression “final loan” with the space.

Waterfall Extractor

The Waterfall Extractor is a Data Type that is a parent or references all of the unique extractors for a piece of data and then determines which one should be given as a final output to a Data Field.

Default Output

Using Order By set to Confidence and Direction set to Descending as the sort criteria, two extractors match with the highest confidence result given first. The FinalLoan extractor matched because it found “finalloan” with no spaces and it is not leveraging FuzzyRegEx, so it matched at 100%. The Final Loan extractor did not match, because it is not using FuzzyRegEx and it did not find a space between the two words so it did not consider it a match. The Fuzzy: Final Loan, leveraging FuzzyRegEx, matched because it was able to make the word “finalloan” into “final loan” by inserting a space and so it was a 90% match.


We would like the actual correct result of final loan to win. There are two ways to do this. One way would be to bump up the confidence of the fuzzy regular expression Data Type Fuzzy: Final Loan. This is done by modifying the Confidence Multiplier property in the Result Options' of the Fuzzy: Final Loan Data Type .


That works for this case, but what if there was another document where the OCR read the space between the two words correctly. In that case, the result from the Final Loan Data Type would match at 100%, and the Fuzzy: Final Loan Data Type, with the Confidence Multiplier property set to 1.2 would match at 120%. While this would technically yield the correct result, it is generally best practice to have the exact match return the highest percentage. There are a couple of ways to tackle this situation. One way would be to bump up the Confidence Multiplier property on the Final Loan Data Type to something like 1.3 But another way, would be to reduce the Confidence Multiplier property on the FinalLoan Data Type so that it returns less than 90%.

Getting the Desired Result

Let's change some settings to set this extractor up to return the results in the desired way; that being with the most right result weighted the highest.

Reset the Confidence Multiplier property in the Result Options property window for the Fuzzy: Final Loan Data Type.

Set the Confidence Multiplier property in the Result Options property window for the FinalLoan Data Type to 0.75. The results on the parent Data Type will now show the un-weighted Data Type Fuzzy: Final Loan at a confidence of 90% (again, because a space was inserted), and the FinalLoan Data Type will show 75%.

In the event another document is OCRed correctly with a space between the words, the Final Loan Data Type would return the exact match at 100%. The Fuzzy: Final Loan Data Type would also return 100% because the expression matched 100% with no substitutions.

In order to make the exact match always preferred, it would also be possible to set the Fuzzy: Final Loan Data Type Confidence Multiplier property to 0.99. But since both the fuzzy and the exact non-fuzzy Data Type matched 100%, it doesn’t really matter which one returns the result.

Waterfall Classification

Setting the Classification Method property on a Content Model to Lexical or Rules-Based, one can set up Positive Extractors on Document Types. If this extractor returns a result above the Minimum Similarity set on the Content Model, the document will be assigned that Document Type during classification. By default a result from an extractor is returned at 100% confidence (unless Fuzzy RegEx is leveraged to return a result, in which case the confidence will be affected by the fuzzy algorithm.) Given this fact positive extractors are almost certain to be above the Minimum Similarity.

This extractor could be a "Waterfall Extractor", taking advantage of the Waterfall Extraction technique. However, for classification, the system is just looking for some result to be returned above the Minimum Similarity confidence threshold.

In the Waterfall Classification method, the Minimum Confidence property can be set in the Result Filter property window of a Data Type which will eliminate any results less than that confidence. This may eliminate the results of some referenced extractors which technically matched, but at a low percent.

If we happen to know that those lower confidence hits are valid and should count for classifying the document, then the Confidence Multipliers on those referenced Data Types can be set to a higher value in order to make them hit the Minimum Confidence required.

Similarly, if higher confidence hits are inappropriately classifying documents and shouldn't be returned, the Confidence Multiplier property can be reduced so that those Data Types only exceed the Minimum Confidence when they are very high confidence.

Example

A base Content Model, Batch, and Batch Process for use with this section can be found here. It is not required to download to understand this section, but can be helpful because it can be used to follow along with the content of this section. This file was exported from and meant for use in Grooper 2.9

  1. Here selected is the Content Model for this setup.
  2. The Classification Method property is set to Lexical which allows for TF-IDF training, as well as a Rulese Based approach.
    • Notice, too, its Minimum Similarity property is at its default 60%.
  1. This Data Type is the extractor supplying the Content Model's Feature Extractor property, which is what is used to train the Form Types of the Title Opinion Document Type.
  1. This Data Type is the extractor supplying the Generic Letter Document Type's positive Extractor property.
  2. It is using the the (new to Grooper 2.9) AND Collation Method.
    • Think of this as somewhat between an Array and an Ordered Array. All extraction results need to be present, but not in a specific order.
  3. The Result Options property’s sub Result Options window has the Output Confidence property set to 60%, therefore all results returned from this extractor will be returend with a confidence of 60%.
  1. This Batch Process Step is configured for ESP Auto Separation at the Batch Scope and pointed at the aforementioned Content Model. The Batch supplied with the zip file is the one being observed.
  2. In the highlighted example, with the Pages classified and the Preview button enabled, you can see that the Title Opinion Document Type similarity for page one is 89%, and the Generic Letter is coming in at the enforced 60%.
    • Were this a previous version of Grooper without this functionality, the "Rule" would have come in at a default 100%, which is obviously higher than the 89%, because confidences could not be manipulated previously. This would have resulted in the false classification of the Tile Opinion as a Generic Letter.
  3. The Positive Extractor of the Generic Letter allows the accurate classification of the letters given the "rule" is at or above the Content Model's Minimum Similarity property.

Version Differences

Prior to Grooper 2.9 the Confidence Multiplier property did not exist.