2.90:Confidence Multiplier and Output Confidence (Property): Difference between revisions

Revision as of 14:20, 16 April 2020

About

Weighted rules is an informal title given to the practical application of the Confidence Multiplier property of a Data Type. Its practical application allows a user to arbitrarily set the confidence of a result of a particular Data Type in order to allow that Data Type to appear more (or less) favorable to a parent Data Type that is leveraging the Order By property configured to the Confidence setting.

Use Cases

Weighted Rules can be used in cases where one is trying to find an element of data which can appear on many similar types of forms that do not have a consistent method to identify where the data is.
For Example, on different forms, the best method to pick up a piece of data may be a Key-Value Pair, a Field Class, a simple pattern match, a pattern match leveraging FuzzyRegEx, or some other method.
One of the more recent methodologies for incorporating multiple extractors to be used by a single field has been coloquially referred to as Waterfall Extraction. This is done by organizing myriad extractors (and their numerous configurations) under a parent Data Type. The Order By property of the parent Data Type can then be set to the following: Position, Frequency, Confidence, Extractor, Length, Value.
Setting Order By to Confidence may be an interesting way to organize results, but typically, properly configured extractors always return their results at 100%. The confidence of a returned result has, historically, only been affected in one of two ways:

Data Type's (or a child Data Format's) regular expression pattern leverages FuzzyRegEx and it, as a result, had to insert, delete, or swap a character to match the pattern, thus generating a result less than 100% confident
Field Classes, by design leverage trained/weighted features and should not return results at 100% confidence

Considering this, a properly configured extractor can, and does, return results below 100%, and thus breaks the logical approach of organizing results by confidence. To elaborate, a result returned at 90% confidence could be more desirable than one returned at 100%.
Let's explore how and why.

How To

Here we'll explore a use case using a mortgage document.

OCR MisreadChild Data Type SetupWaterfall ExtractorDefault OutputGetting the Desired Result

In this example, an OCR error produced a misread the words “final loan” by not recognizing the space between them.

Three Data Types were established to find variations of a result.

FinalLoan	Final Loan	Fuzzy: Final Loan
A Data Type which uses a regular expression looking for the expression “finalloan” with no spaces.	A Data Type which uses a regular expression looking for the expression “final loan” with the space.	A Data Type which uses a fuzzy regular expression looking for the expression “final loan” with the space.

The Waterfall Extractor is a Data Type that is a parent or references all of the unique extractors for a piece of data and then determines which one should be given as a final output to a Data Field.

Using Order By set to Confidence and Direction set to Descending as the sort criteria, two extractors match with the highest confidence result given first. The FinalLoan extractor matched because it found “finalloan” with no spaces and it is not leveraging FuzzyRegEx, so it matched at 100%. The Final Loan extractor did not match, because it is not using FuzzyRegEx and it did not find a space between the two words so it did not consider it a match. The Fuzzy: Final Loan, leveraging FuzzyRegEx, matched because it was able to make the word “finalloan” into “final loan” by inserting a space and so it was a 90% match.

We would like the actual correct result of final loan to win. There are two ways to do this. One way would be to bump up the confidence of the fuzzy regular expression Data Type Fuzzy: Final Loan. This is done by modifying the Confidence Multiplier property in the Result Options' of the Fuzzy: Final Loan Data Type .

That works for this case, but what if there was another document where the OCR read the space between the two words correctly. In that case, the result from the Final Loan Data Type would match at 100%, and the Fuzzy: Final Loan Data Type, with the Confidence Multiplier property set to 1.2 would match at 120%. While this would technically yield the correct result, it is generally best practice to have the exact match return the highest percentage. There are a couple of ways to tackle this situation. One way would be to bump up the Confidence Multiplier property on the Final Loan Data Type to something like 1.3 But another way, would be to reduce the Confidence Multiplier property on the FinalLoan Data Type so that it returns less than 90%.

Let's change some settings to set this extractor up to return the results in the desired way; that being with the most right result weighted the highest.

Reset the Confidence Multiplier property in the Result Options property window for the Fuzzy: Final Loan Data Type.	Set the Confidence Multiplier property in the Result Options property window for the FinalLoan Data Type to 0.75. The results on the parent Data Type will now show the un-weighted Data Type Fuzzy: Final Loan at a confidence of 90% (again, because a space was inserted), and the FinalLoan Data Type will show 75%. In the event another document is OCRed correctly with a space between the words, the Final Loan Data Type would return the exact match at 100%. The Fuzzy: Final Loan Data Type would also return 100% because the expression matched 100% with no substitutions. In order to make the exact match always preferred, it would also be possible to set the Fuzzy: Final Loan Data Type Confidence Multiplier property to 0.99. But since both the fuzzy and the exact non-fuzzy Data Type matched 100%, it doesn’t really matter which one returns the result.

General Usage

Modifying the Confidence Multiplier property of a Data Type is done by clicking on the ellipses in the Result Options property which opens the Result Options submenu. The Confidence Multiplier property defaults to 1 and can be changed in this submenu. The field is a double and takes floating point values, so you can use a value of, for example 0.5 to multiply the confidence of output results by 0.5. If the output confidence was 100%, now it will be 50%. Similarly, you can increase the confidence, even above 100%. If the Confidence Multiplier property is set to 3, and an output result had a 50% confidence, it will not display as 150% confidence.

In Context - Waterfall Classification

Setting the Classification Method property on a Content Model to Rules-Based, one can set up Data Types as Positive Extractors and Negative Extractors, either of which can be Waterfall Extractors just by having child Data Types or referencing other Data Types or Field Classes. In this case, the system is looking for some result to be returned. In the Waterfall Classification method, the Minimum Confidence property can be set in the Result Filter property window of a Data Type which will eliminate any results less than that confidence. This may eliminate the results of some referenced extractors which technically matched, but at a low percent. If we happen to know that those low percentage match hits are valid, then the Confidence Multipliers on those referenced Data Types can be set to a higher value in order to make them hit the Minimum Confidence required. Similarly, if it is desired to discount a high confidence of some extractors which are hitting on the wrong Document Type, the Confidence Multiplier property can be reduced so that those Data Types only exceed the Minimum Confidence when they are very high confidence.

Version Differences

Prior to Grooper 2.9 the Confidence Multiplier property did not exist.

@@ Line 1: / Line 1: @@
 ==About==
-''Weighted rules'' is an informal title given to the practical application of the '''Confidence Multiplier''' property of a '''Data Type'''. Its practical application allows a user to arbitrarily set the confidence of a result of a particular '''Data Type''' in order to allow that '''Data Type''' to appear more (or less) favorable to a parent '''Data Type''' that is leveraging the '''Order By''' property configured to the ''Confidence'' setting.
+''Weighted rules'' is an informal title given to the practical application of the '''''Confidence Multiplier''''' property of a '''Data Type'''. Its practical application allows a user to arbitrarily set the confidence of a result of a particular '''Data Type''' in order to allow that '''Data Type''' to appear more (or less) favorable to a parent '''Data Type''' that is leveraging the '''''Order By''''' property configured to the ''Confidence'' setting.
 ==Use Cases==
 ''Weighted Rules'' can be used in cases where one is trying to find an element of data which can appear on many similar types of forms that do not have a consistent method to identify where the data is.<br/>
 For Example, on different forms, the best method to pick up a piece of data may be a ''Key-Value Pair'', a '''Field Class''', a simple pattern match, a pattern match leveraging ''FuzzyRegEx'', or some other method.<br/>
-One of the more recent methodologies for incorporating multiple extractors to be used by a single field has been coloquially referred to as ''Waterfall Extraction''. This is done by organizing myriad extractors (and their numerous configurations) under a parent '''Data Type'''. The '''Order By''' property of the parent '''Data Type''' can then be set to the following: ''Position'', ''Frequency'', ''Confidence'', ''Extractor'', ''Length'', ''Value''.<br/>
+One of the more recent methodologies for incorporating multiple extractors to be used by a single field has been coloquially referred to as ''Waterfall Extraction''. This is done by organizing myriad extractors (and their numerous configurations) under a parent '''Data Type'''. The '''''Order By''''' property of the parent '''Data Type''' can then be set to the following: ''Position'', ''Frequency'', ''Confidence'', ''Extractor'', ''Length'', ''Value''.<br/>
 Setting '''Order By''' to ''Confidence'' may be an interesting way to organize results, but typically, properly configured extractors always return their results at 100%. The confidence of a returned result has, historically, only been affected in one of two ways:<br/>
 # '''Data Type''''s (or a child '''Data Format''''s) regular expression pattern leverages ''FuzzyRegEx'' and it, as a result, had to insert, delete, or swap a character to match the pattern, thus generating a result less than 100% confident
@@ Line 28: / Line 28: @@
 ! Fuzzy: Final Loan
 |-
-| A Data Type which uses a regular expression looking for the expression “finalloan” with no spaces. || A Data Type which uses a regular expression looking for the expression “final loan” with the space. || A Data Type which uses a fuzzy regular expression looking for the expression “final loan” with the space.
+| A '''Data Type''' which uses a regular expression looking for the expression “finalloan” with no spaces. || A '''Data Type''' which uses a regular expression looking for the expression “final loan” with the space. || A '''Data Type''' which uses a fuzzy regular expression looking for the expression “final loan” with the space.
 |-
 | style="padding: 10px" | [[File:weighted_rules_03.png|center]] || [[File:weighted_rules_04.png|center]] || [[File:weighted_rules_05.png|center]]
@@ Line 41: / Line 41: @@
 [[File:weighted_rules_07.png|center]]
 <br/>
-We would like the actual correct result of ''final loan'' to win. There are two ways to do this. One way would be to bump up the confidence of the fuzzy regular expression '''Data Type'' ''Fuzzy: Final Loan''. This is done by modifying the Confidence Multiplier property in the Result Options of the '''Data Type''' ''Fuzzy: Final Loan''.
+We would like the actual correct result of ''final loan'' to win. There are two ways to do this. One way would be to bump up the confidence of the fuzzy regular expression '''Data Type'' ''Fuzzy: Final Loan''. This is done by modifying the '''''Confidence Multiplier''''' property in the '''Result Options''' of the '''Fuzzy: Final Loan''' '''Data Type''' .
 [[File:weighted_rules_08.png|center]]
 <br/>
-That works for this case, but what if there was another document where the OCR read the space between the two words correctly. In that case, the result from the '''Data Type''' ''Final Loan'' would match at 100%, and the '''Data Type''' ''Fuzzy: Final Loan'', with the '''Confidence Multiplier''' property set to ''1.2'' would match at 120%. While this would technically yield the correct result, it is generally best practice to have the exact match return the highest percentage. There are a couple of ways to tackle this situation. One way would be to bump up the '''Confidence Multiplier''' property on the '''Data Type''' ''Final Loan'' to something like ''1.3'' But another way, would be to reduce the '''Confidence Multiplier''' property on the FinalLoan Data Type so that it returns less than 90%.
+That works for this case, but what if there was another document where the OCR read the space between the two words correctly. In that case, the result from the '''Final Loan''' '''Data Type'''  would match at 100%, and the '''Fuzzy: Final Loan''' '''Data Type''', with the '''''Confidence Multiplier''''' property set to ''1.2'' would match at 120%. While this would technically yield the correct result, it is generally best practice to have the exact match return the highest percentage. There are a couple of ways to tackle this situation. One way would be to bump up the '''''Confidence Multiplier''''' property on the '''Final Loan''' '''Data Type''' to something like ''1.3'' But another way, would be to reduce the '''''Confidence Multiplier''''' property on the '''FinalLoan''' '''Data Type''' so that it returns less than 90%.
 </tab>
 <tab name="Getting the Desired Result" style="margin:25px">
 Let's change some settings to set this extractor up to return the results in the desired way; that being with the most right result ''weighted'' the highest.
 {| class="wikitable" style="text-align:center"
-| Reset the '''Confidence Multiplier''' property in the '''Result Options''' property window for the '''Data Type''' ''Fuzzy: Final Loan''. || Set the '''Confidence Multiplier''' property in the '''Result Options''' property window for the '''Data Type''' ''FinalLoan'' to 0.75. The results on the parent '''Data Type''' will now show the ''un-weighted'' '''Data Type''' Fuzzy: Final Loan''' at a confidence of 90% (again, because a space was inserted), and the '''Data Type''' ''FinalLoan'l will show 75%.
+| Reset the '''''Confidence Multiplier''''' property in the '''Result Options''' property window for the '''Fuzzy: Final Loan''' '''Data Type'''. || Set the '''''Confidence Multiplier''''' property in the '''Result Options''' property window for the '''FinalLoan''' '''Data Type''' to 0.75. The results on the parent '''Data Type''' will now show the ''un-weighted'' '''Data Type''' Fuzzy: Final Loan''' at a confidence of 90% (again, because a space was inserted), and the '''FinalLoan''' '''Data Type''' will show 75%.
 <br/>
-In the event another document is OCRed correctly with a space between the words, the '''Data Type''' ''Final Loan'' would return the exact match at 100%. The '''Data Type''' ''Fuzzy: Final Loan'' would also return 100% because the expression matched 100% with no substitutions.
+In the event another document is OCRed correctly with a space between the words, the '''Final Loan''' '''Data Type'''  would return the exact match at 100%. The '''Fuzzy: Final Loan''' '''Data Type''' would also return 100% because the expression matched 100% with no substitutions.
 <br/>
-In order to make the exact match always preferred, it would also be possible to set the '''Data Type''' ''Fuzzy: Final Loan'' '''Confidence Multiplier''' property to 0.99. But since both the fuzzy and the exact non-fuzzy '''Data Type''' matched 100%, it doesn’t really matter which one returns the result.
+In order to make the exact match always preferred, it would also be possible to set the '''Fuzzy: Final Loan''' '''Data Type''' '''Confidence Multiplier''' property to 0.99. But since both the fuzzy and the exact non-fuzzy '''Data Type''' matched 100%, it doesn’t really matter which one returns the result.
 |-
 | style="padding: 10px" | [[File:weighted_rules_09.png|center]] || style="padding: 10px" | [[File:weighted_rules_10.png|center]]
@@ Line 61: / Line 61: @@
 ===General Usage===
-Modifying the '''Confidence Multiplier''' property of a '''Data Type'''  is done by clicking on the ellipses in the '''Result Options''' property which opens the '''Result Options''' submenu. The '''Confidence Multiplier''' property defaults to 1 and can be changed in this submenu. The field is a double and takes floating point values, so you can use a value of, for example 0.5 to multiply the confidence of output results by 0.5. If the output confidence was 100%, now it will be 50%. Similarly, you can increase the confidence, even above 100%. If the '''Confidence Multiplier''' property is set to 3, and an output result had a 50% confidence, it will not display as 150% confidence.
+Modifying the '''Confidence Multiplier''' property of a '''Data Type'''  is done by clicking on the ellipses in the '''Result Options''' property which opens the '''Result Options''' submenu. The '''''Confidence Multiplier''''' property defaults to 1 and can be changed in this submenu. The field is a double and takes floating point values, so you can use a value of, for example 0.5 to multiply the confidence of output results by 0.5. If the output confidence was 100%, now it will be 50%. Similarly, you can increase the confidence, even above 100%. If the '''''Confidence Multiplier''''' property is set to 3, and an output result had a 50% confidence, it will not display as 150% confidence.
 ===In Context - Waterfall Classification===
-Setting the '''Classification Method''' property on a '''Content Model''' to ''Rules-Based'', one can set up '''Data Types''' as ''Positive Extractors'' and ''Negative Extractors'', either of which can be ''Waterfall Extractors'' just by having child '''Data Types''' or referencing other '''Data Types''' or '''Field Classes'''. In this case, the system is looking for some result to be returned. In the ''Waterfall Classification'' method, the '''Minimum Confidence''' property can be set in the '''Result Filter''' property window of a '''Data Type''' which will eliminate any results less than that confidence. This may eliminate the results of some referenced extractors which technically matched, but at a low percent. If we happen to know that those low percentage match hits are valid, then the '''Confidence Multipliers''' on those referenced '''Data Types''' can be set to a higher value in order to make them hit the '''Minimum Confidence''' required. Similarly, if it is desired to discount a high confidence of some extractors which are hitting on the wrong '''Document Type''', the '''Confidence Multiplier''' property can be reduced so that those '''Data Types''' only exceed the Minimum Confidence when they are very high confidence.
+Setting the '''''Classification Method''''' property on a '''Content Model''' to ''Rules-Based'', one can set up '''Data Types''' as ''Positive Extractors'' and ''Negative Extractors'', either of which can be ''Waterfall Extractors'' just by having child '''Data Types''' or referencing other '''Data Types''' or '''Field Classes'''. In this case, the system is looking for some result to be returned. In the ''Waterfall Classification'' method, the '''''Minimum Confidence''''' property can be set in the '''Result Filter''' property window of a '''Data Type''' which will eliminate any results less than that confidence. This may eliminate the results of some referenced extractors which technically matched, but at a low percent. If we happen to know that those low percentage match hits are valid, then the '''Confidence Multipliers''' on those referenced '''Data Types''' can be set to a higher value in order to make them hit the '''Minimum Confidence''' required. Similarly, if it is desired to discount a high confidence of some extractors which are hitting on the wrong '''Document Type''', the '''''Confidence Multiplier''''' property can be reduced so that those '''Data Types''' only exceed the Minimum Confidence when they are very high confidence.
 ==Version Differences==
 Prior to '''Grooper''' 2.9 the '''''Confidence Multiplier''''' property did not exist.