2.90:Fuzzy RegEx (Concept): Difference between revisions

Revision as of 12:56, 17 November 2020

Fuzzy RegEx (also referred to as "fuzzy matching" or "fuzzy mode" or even just "fuzzy") allows regular expression patterns to match text within a set percentage of similarity. This can allow Grooper users to overcome unpredictable OCR errors when extracting data from documents.

Typically, regular expression will either match a string of text or it won't. If you're trying to match a word and the regex pattern is even a single character off from the text data, you will not return a result.

Fuzzy RegEx uses a Levenshtein distance equation to measure the difference between the regular expression and potential text matches. The percentage difference between the regex pattern and the matched text is expressed as a "confidence score" (also as a percentage). If the confidence is above a set threshold, the result is returned. If it is below the threshold, it is discarded.

For example, a text string that is 95% similar to the regex pattern may be off by just a single character. If the Minimum Similarity threshold is set to 90% the result would be returned, even thought the pattern doesn't match the text exactly.

About

Fuzzy RegEx is a Match Mode option for an extractor's regular expression pattern. Any time you can get to a "Pattern Editor" in Grooper you can take advantage of Fuzzy RegEx including:

When configuring a Data Format
When configuring the Pattern property of a Data Type
When choosing the Internal option for various extractors in a property panel and configuring its Pattern property
When choosing the Text Pattern option for various extractors in a property panel and configuring its Pattern property

In the "Pattern Editor" window, you can enable Fuzzy RegEx in two steps:

Navigate to the "Properties" tab.
Select the Mode property and choose Fuzzy RegEx

The Problem

Standard regular expression is binary. Either a result will match the pattern or it won't. The only "wiggle room" is baked into the standard syntax, such as the wildcard "dot" (.) metacharacter or the question mark (?) metacharacter optionally matching a character or character group. Even then, it's all or nothing. The text will 100% match the pattern as written, or it 0% will not.

What's the difference between these two strings of text?	One starts with a "G". The other starts with the number six.	If we used a simple regular expression to literally match `Grooper`, the first would match, and the second would not.

However, OCR results are imperfect. Characters are recognized as different from what is on the page. Spaces get inserted or removed between characters. Even with Image Processing and OCR Synthesis software as good as Grooper has, you can't assume 100% accurate OCR results on every document. And, even if the OCR results are one tiny character off, the pattern will fail to produce a match.

For example, it's not outside of the realm of possibility an OCR engine would see a cluster of pixels representing a "G" and erroneously recognize it as a "6", depending on the font used or quality of the scanned image.

How Fuzzy RegEx Solves the Problem

If you break up the string "Grooper" into characters, it is composed of seven characters. For the regex pattern Grooper, each character in the pattern matches each character in the string starting with "G" through the last "r". The only difference between "Grooper" and "6rooper" is the "G" is replaced by a "6".

But otherwise, six out of the seven characters match the regex pattern Grooper.

Being so close, it seems like the regex should match (Or at least we surely might like it to!). Fuzzy RegEx gives you the ability to return the results of these "near matches". As long as they are similar enough to the regular expression pattern, the result is still returned, allowing you to extract data from imperfect OCR results!

So, how do you say what should match and what shouldn't? After all, "6200932" is also seven characters long but certainly shouldn't match "Grooper". Should "Gopher" match? Should "Grapple"? Probably not. What is "similar enough"? How do you measure it? How do you control it?

Fuzzy RegEx measures the similarity between a regular expression pattern and a potential match in terms of percentage similarity. The regular expression pattern Grooper matches the text string "Grooper" at a 100% similarity. It is an exact match. Being seven characters long, a single character in the string "Grooper" makes up roughly 14% of the word.

Fuzzy RegEx matches the string "6rooper" at the cost of the percentage of the different character. 100% -14% = 86%. The string "6rooper" is therefore 86% similar to the regex pattern Grooper.

The result is then assigned a "Confidence Score" of 86%. The extractor is 86% "sure" or confident, the result matches the regex pattern.

Whether or not this result is actually returned by the extractor is determined by the Minimum Similarity property of Fuzzy RegEx. If this is set below 86%, the result will be returned. If it is set above 86%, it is not.

If you've ever used auto correct on your cell phone, you're engaging in something similar. Your cell phone's software has a dictionary of common English words. If you type something that doesn't match that dictionary, it will automatically swap the word you type with the most similar word in the dictionary. While the exact algorithm may vary from the one we used, it is based on some variant of a Levenshtein distance equation.

Character Swaps and Swap Cost

Another way to define the Levenshtein distance between two words is the "minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other" [1]. We often informally refer to these edits as "swaps" and the effect these edits have on a match's confidence the "swap cost". The most similar fuzzy matched text to the regular expression will have the least "swap cost" and therefore highest Confidence score.

This also means Fuzzy RegEx not only matches imperfect matches, but it mutates them as well. In our example where our regex Grooper fuzzy matches "6rooper", the "6" would "swapped" for a "G", resulting in an 86% confident match.

Furthermore, the final output result would be altered to match the regex. In this case, "Grooper"

In this way, Fuzzy RegEx can be used to cleanse imperfect (or "dirty") OCR data, manipulating it into what you want.

Levenshtein distance will also measure the distance between two words where characters are missing or extra characters are present. The difference between "Grooper" and "Groooper" is just one extra "o" character. The difference between "Grooper" and "Groper" is one less "o" character.

As well as swapping characters to match the string, Fuzzy RegEx will insert or delete characters as well. Whether a true swap, insertion or deletion, the "swap cost" is the same. In the examples below, a single character is deleted or inserted. Just as changing a single character in "6rooper" to match "Grooper" resulted in an 86% confidence score, inserting or deleting a single character to match the regex Grooper would also result in a confidence score of 86%.

Fuzzy Character Deletion		Fuzzy Character Insertion

Adjusting the Swap Cost: Fuzzy Match Weightings

How To

Fuzzy RegEx Basics

Note from the Author: Fuzzy RegEx is intended to resolve issues around OCR errors. However, for the purposes of this tutorial, the text is simply misspelled on the "document". Either way, whether misspelled or mis-OCR'd Fuzzy RegEx works to swap characters between the regular expression pattern and whatever is in the text data obtained via the Recognize activity.

Getting to a Pattern EditorEnable Fuzzy RegExAdjust Minimum SimilarityAdjust Fuzzy Match Weightings

Getting to a Pattern Editor

You can enable Fuzzy RegEx any time you are writing a regular expression pattern in a "Pattern Editor". The "Pattern Editor" can be accessed at several points in Grooper:

When configuring a Data Format Select a Data Format Note: Data Formats are always and only children of Data Type objects. That's it! That's all you need to do. Data Formats are very simple extractors. Their configuration panel is itself a "Pattern Editor" window.
When configuring the Pattern property of a Data Type Select a Data Type. Select the Pattern property. Press the ellipsis button at the end to bring up the "Pattern Editor" window.
When choosing the Text Pattern option for various extractors in a property panel and configuring its Pattern property Including (but not limited to) configuring a Data Field's Value Extractor property. Select a Data Field Select the Value Extractor property and choose Text Pattern. Expand the Text Pattern sub properties, select Pattern Press the ellipsis button at the end to bring up the "Pattern Editor" window. For any Grooper object whose property panel has an extractor property with an Text Pattern option, you can get to the Pattern Editor.
When choosing the Internal option for various extractors in a property panel and configuring its Pattern property Including (but not limited to) choosing the Internal option for the Pattern-Based Separation provider's Value Extractor. Select a Separation Profile Select the Provider property and choose Pattern-Based Separation. Expand the Pattern-Based Separation sub properties. Select Value Pattern, and expand it's sub properties. Select Type and choose Internal. Expand the Internal sub properties and select the Pattern property. Press the ellipsis button at the end to bring up the "Pattern Editor" window. For any Grooper object whose property panel has an extractor property with an Internal option, you can get to the Pattern Editor.

Enable Fuzzy RegEx

Once you are in a "Pattern Editor" window, enabling Fuzzy RegEx is very simple.

The first thing you'll want to do is type your regular expression pattern like normal in the Value Pattern editor.
- For this tutorial we are simply matching the word Grooper.
With standard regex matching, we only produce a single result, the string "Grooper" that matches the regex pattern perfectly.

Fuzzy RegEx is a Mode property option, found in the "Properties" tab.

Navigate to the "Properties" tab.
Select the Mode property and choose Fuzzy RegEx

Adjust Minimum Similarity

As configured currently, we still only get a single result, the perfect match for the string "Grooper". From our information earlier in the article, the strings "6rooper", "Groper", and "Groooper" are just one character different from the regex pattern `Grooper`. These equate to roughly 86% similar to `Grooper`. The "swap cost" to match these strings to the pattern is 14% confidence. The reason these fuzzy matches are not showing up in our results list is due to the Minimum Similarity property. Since these strings only match with a confidence score of 86%, they fall below the default Minimum Similarity threshold of 90% and are thrown out.
Change the Minimum Similarity to 80%. Now, those three results are returned with a value in the "Confidence" column of 86%, indicating the "swap cost" it took to mutate the string in the text data to match the regex pattern `Grooper` Notice the string "Gr00per" is not returned. This string is not one, but two characters different from the regex pattern `Grooper`. It would cost more confidence to swap the second character, resulting in an even lower confidence score.
Change the Minimum Similarity to 70% Now, this last string is returned with a confidence score of 71%.

Adjust Fuzzy Match Weightings

By default, it costs the same to swap, insert, or delete a character regardless of the character. However, certain OCR mistakes are more common than others. For example, an "O" might get misrecognized as a zero (or vice versa).

You can manually change the swap cost associated certain characters using the Fuzzy Match Weightings properties.

In the Fuzzy Matching Options section, expand the Fuzzy Match Weightings properties.
- You can alter the match weightings using either the Local Entries or the Included Lexicons properties (or both). We will focus on Local Entries for the time being.
Select the Local Entries property and press the ellipsis button at the end.
This will bring up a "List Editor" window.

Depending on what you want to do, what you will enter here will change somewhat. Character swaps, inserts and deletions all have different syntax to alter the swap cost of certain characters.

Character Swaps

Imagine a "6" is often mis-OCR'd as a "G", resulting in "6rooper" in the text data instead of "Grooper". Knowing this is a common error, we might want to say the swap should not cost the full amount in this case.

This is exactly the reason why Fuzzy Match Weightings exist! In the list editor, you will adjust the swap cost using the following syntax:

(The mis-OCR'd character on the document)(The correct character in the regex pattern)=(Percentage change from 100% as decimal)

or, put another way:

{DocumentChar}{PatternChar}={Cost}

So, if we wanted to say whenever Fuzzy RegEx swaps a "6" for a "G" it should only cost half the amount, you'd use the following syntax:

6G=0.5

In this case, the Minimum Similarity is bumped back up to the default of 90%.
The string "6rooper" now returns with a confidence score of 93% instead of 86%.

Normally, the swap would cost a confidence of 14%, but the weighting 6G=0.5 multiplied that by "0.5", resulting in a swap cost of only 7%.

Fuzzy Match Weightings work by first adjusting the normal baseline swap cost of a character.

Base Swap Cost({DocumentChar}{PatternChar}) x Weighting Value = Weighted Swap Cost
14%({6}{G}) x 0.5 = 7%

Then the weighted swap cost is subtracted, instead of the baseline swap cost to determine the confidence score.

100% - Weighted Swap Cost = Confidence Score
100% - 7% = 93%

@@ Line 224: / Line 224: @@
 <tab name="Adjust Fuzzy Match Weightings" style="margin:20px">
 === Adjust Fuzzy Match Weightings ===
+By default, it costs the same to swap, insert, or delete a character regardless of the character.  However, certain OCR mistakes are more common than others.  For example, an "O" might get misrecognized as a zero (or vice versa).
+{|cellpadding=10 cellspacing=5
+|style="width:40%" valign=top|
+You can manually change the swap cost associated certain characters using the '''''Fuzzy Match Weightings''''' properties.
+# In the '''''Fuzzy Matching Options''''' section, expand the '''''Fuzzy Match Weightings''''' properties.
+#* You can alter the match weightings using either the '''''Local Entries''''' or the '''''Included Lexicons''''' properties (or both).  We will focus on '''''Local Entries''''' for the time being.
+# Select the '''''Local Entries''''' property and press the ellipsis button at the end.
+# This will bring up a "List Editor" window.
+Depending on what you want to do, what you will enter here will change somewhat.  Character swaps, inserts and deletions all have different syntax to alter the swap cost of certain characters.
+|
+[[File:Fuzzy-regex-how-to-09.png]]
+|}
+=== Character Swaps ===
+{|cellpadding=10 cellspacing=5
+|style="width:40%" valign=top|
+Imagine a "6" is often mis-OCR'd as a "G", resulting in "6rooper" in the text data instead of "Grooper".  Knowing this is a common error, we might want to say the swap should ''not'' cost the full amount in this case.
+This is exactly the reason why Fuzzy Match Weightings exist!  In the list editor, you will adjust the swap cost using the following syntax:
+<code>(The mis-OCR'd character on the document)(The correct character in the regex pattern)=(Percentage change from 100% as decimal)</code>
+or, put another way:
+<code>{DocumentChar}{PatternChar}={Cost}</code>
+So, if we wanted to say whenever ''Fuzzy RegEx'' swaps a "6" for a "G" it should only cost ''half'' the amount, you'd use the following syntax:
+<code>6G=0.5</code>
+|
+[[File:Fuzzy-regex-how-to-10.png]]
+|-
+|valign=top|
+# In this case, the '''''Minimum Similarity''''' is bumped back up to the default of ''90%''.
+# The string "6rooper" now returns with a confidence score of 93% instead of 86%.
+Normally, the swap would cost a confidence of 14%, but the weighting <code>6G=0.5</code> multiplied that by "0.5", resulting in a swap cost of only 7%.
+Fuzzy Match Weightings work by first adjusting the normal baseline swap cost of a character.
+* Base Swap Cost({DocumentChar}{PatternChar}) x Weighting Value = Weighted Swap Cost
+* 14%({6}{G}) x 0.5 = 7%
+Then the weighted swap cost is subtracted, instead of the baseline swap cost to determine the confidence score.
+* 100% - Weighted Swap Cost = Confidence Score
+* 100% - 7% = 93%
+|
+[[File:Fuzzy-regex-how-to-11.png]]
+|}
 </tab>
 </tabs>