2.90:Confidence Multiplier and Output Confidence (Property): Difference between revisions

From Grooper Wiki
Created page with "==Introduction== ''Weighted rules'' is an informal title given to the practical application of the '''Confidence Multiplier'' property of a '''Data Type'''. Its practical appl..."
 
No edit summary
Line 1: Line 1:
==Introduction==
==Introduction==
''Weighted rules'' is an informal title given to the practical application of the '''Confidence Multiplier'' property of a '''Data Type'''. Its practical application allows a user to arbitrarily set the confidence of a result of a particular '''Data Type''' in order to allow that '''Data Type''' to appear more (or less) favorable to a parent '''Data Type''' that is leveraging the '''Order By''' property configured to the ''Confidence'' setting.
''Weighted rules'' is an informal title given to the practical application of the '''Confidence Multiplier''' property of a '''Data Type'''. Its practical application allows a user to arbitrarily set the confidence of a result of a particular '''Data Type''' in order to allow that '''Data Type''' to appear more (or less) favorable to a parent '''Data Type''' that is leveraging the '''Order By''' property configured to the ''Confidence'' setting.


==Use Cases==
==Use Cases==
''Weighted Rules'' can be used in cases where one is trying to find an element of data which can appear on many similar types of forms that do not have a consistent method to identify where the data is.<br/>
''Weighted Rules'' can be used in cases where one is trying to find an element of data which can appear on many similar types of forms that do not have a consistent method to identify where the data is.<br/>
For Example, on different forms, the best method to pick up a piece of data may be a ''Key-Value Pair'', a '''Field Class''', a simple pattern match, a pattern match leveraging FuzzyRegEx, or some other method.<br/>
For Example, on different forms, the best method to pick up a piece of data may be a ''Key-Value Pair'', a '''Field Class''', a simple pattern match, a pattern match leveraging ''FuzzyRegEx'', or some other method.<br/>
One of the more recent methodologies for incorporating multiple extractors to be used by a single field has been coloquially referred to as ''waterfall extraction''. This is done by organizing myriad extractors (and their numerous configurations) under a parent '''Data Type'''. The '''Order By''' property of the parent '''Data Type''' can then be set to the following: ''Position'', ''Frequency'', ''Confidence'', ''Extractor'', ''Length'', ''Value''.<br/>
One of the more recent methodologies for incorporating multiple extractors to be used by a single field has been coloquially referred to as ''Waterfall Extraction''. This is done by organizing myriad extractors (and their numerous configurations) under a parent '''Data Type'''. The '''Order By''' property of the parent '''Data Type''' can then be set to the following: ''Position'', ''Frequency'', ''Confidence'', ''Extractor'', ''Length'', ''Value''.<br/>
Setting '''Order By'' to ''Confidence'' may be an interesting way to organize results, but typically, properly configured extractors always return their results at 100%. The confidence of a returned result has, historically, only been affected in one of two ways:<br/>
Setting '''Order By''' to ''Confidence'' may be an interesting way to organize results, but typically, properly configured extractors always return their results at 100%. The confidence of a returned result has, historically, only been affected in one of two ways:<br/>
# Data Type's (or a child Data Format's) regular expression pattern leverages FuzzyRegEx and it, as a result, had to insert, delete, or swap a character to match the pattern, thus generating a result less than 100% confident
# '''Data Type''''s (or a child '''Data Format''''s) regular expression pattern leverages ''FuzzyRegEx'' and it, as a result, had to insert, delete, or swap a character to match the pattern, thus generating a result less than 100% confident
# Field Classes, by their design, in that they leverage trained, weighted results, should not return results at 100% confidence
# ''Field Class''es, by design leverage trained/weighted features and should not return results at 100% confidence<br/>
<br/>
Considering this, a properly configured extractor can, and does, return results below 100%, and thus breaks the logical approach of organizing results by confidence. To elaborate, a result returned at 90% confidence could be more desirable than one returned at 100%.<br/>
Considering this, a properly configured extractor can, and does, return results below 100%, and thus breaks the logical approach of organizing results by confidence. To elaborate, a result returned at 90% confidence could be more desirable than one returned at 100%.<br/>
Let's explore how and why.
Let's explore how and why.
==Configuration==
Here we'll explore a use case using a mortgage document.
<tabs style="margin:20px">
<tab name="OCR Misread" style="margin:25px">
In this example, an OCR error produced a misread the words “final loan” by not recognizing the space between them.
[[File:weighted_rules_01.png|center]]
</tab>
</tabs>

Revision as of 16:48, 27 March 2020

Introduction

Weighted rules is an informal title given to the practical application of the Confidence Multiplier property of a Data Type. Its practical application allows a user to arbitrarily set the confidence of a result of a particular Data Type in order to allow that Data Type to appear more (or less) favorable to a parent Data Type that is leveraging the Order By property configured to the Confidence setting.

Use Cases

Weighted Rules can be used in cases where one is trying to find an element of data which can appear on many similar types of forms that do not have a consistent method to identify where the data is.
For Example, on different forms, the best method to pick up a piece of data may be a Key-Value Pair, a Field Class, a simple pattern match, a pattern match leveraging FuzzyRegEx, or some other method.
One of the more recent methodologies for incorporating multiple extractors to be used by a single field has been coloquially referred to as Waterfall Extraction. This is done by organizing myriad extractors (and their numerous configurations) under a parent Data Type. The Order By property of the parent Data Type can then be set to the following: Position, Frequency, Confidence, Extractor, Length, Value.
Setting Order By to Confidence may be an interesting way to organize results, but typically, properly configured extractors always return their results at 100%. The confidence of a returned result has, historically, only been affected in one of two ways:

  1. Data Type's (or a child Data Format's) regular expression pattern leverages FuzzyRegEx and it, as a result, had to insert, delete, or swap a character to match the pattern, thus generating a result less than 100% confident
  2. Field Classes, by design leverage trained/weighted features and should not return results at 100% confidence

Considering this, a properly configured extractor can, and does, return results below 100%, and thus breaks the logical approach of organizing results by confidence. To elaborate, a result returned at 90% confidence could be more desirable than one returned at 100%.
Let's explore how and why.

Configuration

Here we'll explore a use case using a mortgage document.

In this example, an OCR error produced a misread the words “final loan” by not recognizing the space between them.