2.90:Fuzzy RegEx (Concept): Difference between revisions
Dgreenwood (talk | contribs) |
Dgreenwood (talk | contribs) No edit summary |
||
| Line 187: | Line 187: | ||
|- | |- | ||
|valign=top| | |valign=top| | ||
''Fuzzy RegEx'' is a '''''Mode''''' property option, found in the "Properties" tab. | |||
# Navigate to the "Properties" tab. | |||
# Select the '''''Mode''''' property and choose ''Fuzzy RegEx'' | |||
| | |||
[[File:Fuzzy-regex-about-01.png]] | |||
|} | |} | ||
| Line 194: | Line 198: | ||
<tab name="Adjust Minimum Similarity" style="margin:20px"> | <tab name="Adjust Minimum Similarity" style="margin:20px"> | ||
=== Adjust Minimum Similarity === | === Adjust Minimum Similarity === | ||
{|cellpadding=10 cellspacing=5 | |||
|style="width:40%" valign=top| | |||
# As configured currently, we still only get a single result, the perfect match for the string "Grooper". | |||
#* From our information earlier in the article, the strings "6rooper", "Groper", and "Groooper" are just one character different from the regex pattern <code>Grooper</code>. These equate to roughly 86% similar to <code>Grooper</code>. The "swap cost" to match these strings to the pattern is 14% confidence. | |||
# The reason these fuzzy matches are not showing up in our results list is due to the '''''Minimum Similarity''''' property. | |||
#* Since these strings only match with a confidence score of 86%, they fall ''below'' the default '''''Minimum Similarity''''' threshold of ''90%'' and are thrown out. | |||
| | |||
[[File:Fuzzy-regex-how-to-06.png]] | |||
|- | |||
|valign=top| | |||
# Change the '''''Minimum Similarity''''' to ''80%''. | |||
# Now, those three results are returned with a value in the "Confidence" column of 86%, indicating the "swap cost" it took to mutate the string in the text data to match the regex pattern <code>Grooper</code> | |||
# Notice the string "Gr00per" is not returned. This string is not one, but two characters different from the regex pattern <code>Grooper</code>. | |||
#* It would cost more confidence to swap the second character, resulting in an even lower confidence score. | |||
| | |||
[[File:Fuzzy-regex-how-to-07.png]] | |||
|- | |||
|valign=top| | |||
# Change the '''''Minimum Similarity''''' to ''70%'' | |||
# Now, this last string ''is'' returned with a confidence score of 71%. | |||
| | |||
[[File:Fuzzy-regex-how-to-08.png]] | |||
|} | |||
</tab> | </tab> | ||
<tab name="Adjust Fuzzy Match Weightings" style="margin:20px"> | <tab name="Adjust Fuzzy Match Weightings" style="margin:20px"> | ||
Revision as of 13:10, 17 November 2020
Fuzzy RegEx (also referred to as "fuzzy matching" or "fuzzy mode" or even just "fuzzy") allows regular expression patterns to match text within a set percentage of similarity. This can allow Grooper users to overcome unpredictable OCR errors when extracting data from documents.
Typically, regular expression will either match a string of text or it won't. If you're trying to match a word and the regex pattern is even a single character off from the text data, you will not return a result.
Fuzzy RegEx uses a Levenshtein distance equation to measure the difference between the regular expression and potential text matches. The percentage difference between the regex pattern and the matched text is expressed as a "confidence score" (also as a percentage). If the confidence is above a set threshold, the result is returned. If it is below the threshold, it is discarded.
For example, a text string that is 95% similar to the regex pattern may be off by just a single character. If the Minimum Similarity threshold is set to 90% the result would be returned, even thought the pattern doesn't match the text exactly.
About
|
Fuzzy RegEx is a Match Mode option for an extractor's regular expression pattern. Any time you can get to a "Pattern Editor" in Grooper you can take advantage of Fuzzy RegEx including:
|
The Problem
Standard regular expression is binary. Either a result will match the pattern or it won't. The only "wiggle room" is baked into the standard syntax, such as the wildcard "dot" (.) metacharacter or the question mark (?) metacharacter optionally matching a character or character group. Even then, it's all or nothing. The text will 100% match the pattern as written, or it 0% will not.
However, OCR results are imperfect. Characters are recognized as different from what is on the page. Spaces get inserted or removed between characters. Even with Image Processing and OCR Synthesis software as good as Grooper has, you can't assume 100% accurate OCR results on every document. And, even if the OCR results are one tiny character off, the pattern will fail to produce a match.
For example, it's not outside of the realm of possibility an OCR engine would see a cluster of pixels representing a "G" and erroneously recognize it as a "6", depending on the font used or quality of the scanned image.
How Fuzzy RegEx Solves the Problem
|
If you break up the string "Grooper" into characters, it is composed of seven characters. For the regex pattern But otherwise, six out of the seven characters match the regex pattern |
Being so close, it seems like the regex should match (Or at least we surely might like it to!). Fuzzy RegEx gives you the ability to return the results of these "near matches". As long as they are similar enough to the regular expression pattern, the result is still returned, allowing you to extract data from imperfect OCR results!
So, how do you say what should match and what shouldn't? After all, "6200932" is also seven characters long but certainly shouldn't match "Grooper". Should "Gopher" match? Should "Grapple"? Probably not. What is "similar enough"? How do you measure it? How do you control it?
|
Fuzzy RegEx measures the similarity between a regular expression pattern and a potential match in terms of percentage similarity. The regular expression pattern |
|
|
Fuzzy RegEx matches the string "6rooper" at the cost of the percentage of the different character. 100% -14% = 86%. The string "6rooper" is therefore 86% similar to the regex pattern The result is then assigned a "Confidence Score" of 86%. The extractor is 86% "sure" or confident, the result matches the regex pattern. Whether or not this result is actually returned by the extractor is determined by the Minimum Similarity property of Fuzzy RegEx. If this is set below 86%, the result will be returned. If it is set above 86%, it is not. |
If you've ever used auto correct on your cell phone, you're engaging in something similar. Your cell phone's software has a dictionary of common English words. If you type something that doesn't match that dictionary, it will automatically swap the word you type with the most similar word in the dictionary. While the exact algorithm may vary from the one we used, it is based on some variant of a Levenshtein distance equation.
Character Swaps and Swap Cost
Another way to define the Levenshtein distance between two words is the "minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other" [1]. We often informally refer to these edits as "swaps" and the effect these edits have on a match's confidence the "swap cost". The most similar fuzzy matched text to the regular expression will have the least "swap cost" and therefore highest Confidence score.
|
This also means Fuzzy RegEx not only matches imperfect matches, but it mutates them as well. In our example where our regex Furthermore, the final output result would be altered to match the regex. In this case, "Grooper" In this way, Fuzzy RegEx can be used to cleanse imperfect (or "dirty") OCR data, manipulating it into what you want. |
Levenshtein distance will also measure the distance between two words where characters are missing or extra characters are present. The difference between "Grooper" and "Groooper" is just one extra "o" character. The difference between "Grooper" and "Groper" is one less "o" character.
As well as swapping characters to match the string, Fuzzy RegEx will insert or delete characters as well. Whether a true swap, insertion or deletion, the "swap cost" is the same. In the examples below, a single character is deleted or inserted. Just as changing a single character in "6rooper" to match "Grooper" resulted in an 86% confidence score, inserting or deleting a single character to match the regex Grooper would also result in a confidence score of 86%.
| Fuzzy Character Deletion | Fuzzy Character Insertion | |
![]() |
|
Adjusting the Swap Cost: Fuzzy Match Weightings
How To
Fuzzy RegEx Basics
Note from the Author: Fuzzy RegEx is intended to resolve issues around OCR errors. However, for the purposes of this tutorial, the text is simply misspelled on the "document". Either way, whether misspelled or mis-OCR'd Fuzzy RegEx works to swap characters between the regular expression pattern and whatever is in the text data obtained via the Recognize activity.
Getting to a Pattern Editor
You can enable Fuzzy RegEx any time you are writing a regular expression pattern in a "Pattern Editor". The "Pattern Editor" can be accessed at several points in Grooper:
|
When configuring a Data Format
|
|
|
When configuring the Pattern property of a Data Type
|
|
|
When choosing the Text Pattern option for various extractors in a property panel and configuring its Pattern property Including (but not limited to) configuring a Data Field's Value Extractor property.
For any Grooper object whose property panel has an extractor property with an Text Pattern option, you can get to the Pattern Editor. |
|
|
When choosing the Internal option for various extractors in a property panel and configuring its Pattern property Including (but not limited to) choosing the Internal option for the Pattern-Based Separation provider's Value Extractor.
For any Grooper object whose property panel has an extractor property with an Internal option, you can get to the Pattern Editor. |
Enable Fuzzy RegEx
Once you are in a "Pattern Editor" window, enabling Fuzzy RegEx is very simple.
|
|
|
Fuzzy RegEx is a Mode property option, found in the "Properties" tab.
|
Adjust Minimum Similarity
|
|
|
|
|

















