2023:Fuzzy RegEx (Concept): Difference between revisions
| Line 711: | Line 711: | ||
# However, due to this table's format, we're getting some false positives where the "TOTAL" in "TOTAL REVENUE" spans to the "SALES" in "SALES TAX", producing a match for our regex, <code>total sales</code> | # However, due to this table's format, we're getting some false positives where the "TOTAL" in "TOTAL REVENUE" spans to the "SALES" in "SALES TAX", producing a match for our regex, <code>total sales</code> | ||
| | | | ||
[[File: | [[File:2023_Fuzzy_Regex_04_How_To_04_4_Immutable_Characters_Introduction_01.png]] | ||
|- | |- | ||
|valign=top| | |valign=top| | ||
| Line 723: | Line 723: | ||
# However, the bottom table's "TOTAL SALES" header does indeed have some OCR errors. | # However, the bottom table's "TOTAL SALES" header does indeed have some OCR errors. | ||
| | | | ||
[[File: | [[File:2023_Fuzzy_Regex_04_How_To_04_4_Immutable_Characters_Introduction_02.png]] | ||
|} | |} | ||
</tab> | </tab> | ||
Revision as of 16:04, 15 November 2023
Fuzzy RegEx (also referred to as "fuzzy matching" or "fuzzy mode" or even just "fuzzy") allows regular expression patterns to match text within a set percentage of similarity. This can allow Grooper users to overcome unpredictable OCR errors when extracting data from documents.
Typically, regular expression will either match a string of text or it won't. If you're trying to match a word and the regex pattern is even a single character off from the text data, you will not return a result.
Fuzzy RegEx uses a Levenshtein distance equation to measure the difference between the regular expression and potential text matches. The percentage difference between the regex pattern and the matched text is expressed as a "confidence score" (also as a percentage). If the confidence is above a set threshold, the result is returned. If it is below the threshold, it is discarded.
For example, a text string that is 95% similar to the regex pattern may be off by just a single character. If the Minimum Similarity threshold is set to 90% the result would be returned, even though the pattern doesn't match the text exactly.
About
|
Fuzzy RegEx is a Match Mode option for an extractor's regular expression pattern. Any time you can get to a "Pattern Editor" in Grooper, you can take advantage of Fuzzy RegEx including:
|
The Problem
Standard regular expression is binary. Either a result will match the pattern or it won't. The only "wiggle room" is baked into the standard syntax, such as the wildcard "dot" (.) metacharacter or the question mark (?) metacharacter optionally matching a character or character group. Even then, it's all or nothing. The text will 100% match the pattern as written, or it 0% will not.
However, OCR results are imperfect. Characters are recognized as different from what is on the page. Spaces get inserted or removed between characters. Even with Image Processing and OCR Synthesis software as good as Grooper has, you can't assume 100% accurate OCR results on every document. And, even if the OCR results are one tiny character off, the pattern will fail to produce a match.
For example, it's not outside of the realm of possibility an OCR engine would see a cluster of pixels representing a "G" and erroneously recognize it as a "6", depending on the font used or quality of the scanned image.
How Fuzzy RegEx Solves the Problem
|
If you break up the string "Grooper" into characters, it is composed of seven characters. For the regex pattern But otherwise, six out of the seven characters match the regex pattern |
Being so close, it seems like the regex should match (Or at least we surely might like it to!). Fuzzy RegEx gives you the ability to return the results of these "near matches". As long as they are similar enough to the regular expression pattern, the result is still returned, allowing you to extract data from imperfect OCR results!
So, how do you say what should match and what shouldn't? After all, "6200932" is also seven characters long but certainly shouldn't match "Grooper". Should "Gopher" match? Should "Grapple"? Probably not. What is "similar enough"? How do you measure it? How do you control it?
|
Fuzzy RegEx measures the similarity between a regular expression pattern and a potential match in terms of percentage similarity. The regular expression pattern |
|
|
Fuzzy RegEx matches the string "6rooper" at the cost of the percentage of the different character. 100% -14% = 86%. The string "6rooper" is therefore 86% similar to the regex pattern The result is then assigned a "Confidence Score" of 86%. The extractor is 86% "sure" or confident, the result matches the regex pattern. Whether or not this result is actually returned by the extractor is determined by the Minimum Similarity property of Fuzzy RegEx. If this is set below 86%, the result will be returned. If it is set above 86%, it is not. |
If you've ever used auto correct on your cell phone, you're engaging in something similar. Your cell phone's software has a dictionary of common English words. If you type something that doesn't match that dictionary, it will automatically swap the word you type with the most similar word in the dictionary. While the exact algorithm may vary from the one we used, it is based on some variant of a Levenshtein distance equation.
Character Swaps and Swap Cost
Another way to define the Levenshtein distance between two words is the "minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other" [1]. We often informally refer to these edits as "swaps" and the effect these edits have on a match's confidence the "swap cost". The most similar fuzzy matched text to the regular expression will have the least "swap cost" and therefore highest confidence score.
|
This also means Fuzzy RegEx not only matches imperfect matches, but it mutates them as well. In our example where our regex Furthermore, the final output result would be altered to match the regex. In this case, "Grooper" In this way, Fuzzy RegEx can be used to cleanse imperfect (or "dirty") OCR data, manipulating it into what you want. |
Levenshtein distance will also measure the distance between two words where characters are missing or extra characters are present. The difference between "Grooper" and "Groooper" is just one extra "o" character. The difference between "Grooper" and "Groper" is one less "o" character.
As well as swapping characters to match the string, Fuzzy RegEx will insert or delete characters as well. Whether a true swap, insertion or deletion, the "swap cost" of a character is the same. In the examples below, a single character is deleted or inserted. Just as changing a single character in "6rooper" to match "Grooper" resulted in an 86% confidence score, inserting or deleting a single character to match the regex Grooper would also result in a confidence score of 86%.
| Fuzzy Character Deletion | Fuzzy Character Insertion | |
![]() |
|
Adjusting the Swap Cost: Fuzzy Match Weightings
Certain OCR errors are more common than others. Even a human being can have difficulty distinguishing between certain characters. For example, a capital "O" and a zero look extremely similar, even indistinguishable depending on the font. Especially if the data is a mix of numbers and letters, it may be difficult to say whether or not the character should be an "O" or a "0".
However, sometimes it's exceptionally easy to determine if the character should be an "O" or a "0". If you're looking for some kind of label or phrase on a document, it's highly unlikely a zero is going to be in the words you're looking for. English language doesn't use numerical characters as part of its alphabet. It doesn't make sense to have a "0" in the middle of a word. The context just doesn't provide for it.
Knowing this, it seems like some character swaps should not be as severe, depending on the circumstances. Fuzzy Match Weightings allow you to capitalize on this idea. This functionality allows you to manually adjust the base swap cost for swapping one character for another (or inserting or deleting a character). By adjusting the swap cost, you adjust the confidence score, resulting in a more confident match with the same number of character edits to match your regular expression pattern.
|
For example, the letter "G" can get misrecognized as a "6" during OCR, depending on the document, the fonts used, the OCR engine, or any of the other things that affect character recognition. As we've seen in our example above, the cost to swap a single character between the pattern What if we only want to return fuzzy matches with a confidence above 90%? We'd be out of luck here. The result would be discarded. |
|
|
However, we know a "G" can get misrecognized for a "6" during OCR, and the pattern we're trying to match shouldn't even have numerical characters. We might say it should not cost the full 14% confidence to swap the characters. Perhaps swapping a "6" in the document text data for the |
|
|
With a Fuzzy Match Weighting, we can do just that. This weighting would adjust the base swap cost for a "6" for a Now, the fuzzy match's confidence score is 93%. If we're only returning fuzzy matches with a confidence above 90%, the result would be returned. |
For more information on Fuzzy Match Weigthings, visit the How To section of this article.
Words of Caution
Fuzzy RegEx is an exceptional way to resolve unforeseen and unpreventable OCR errors when extracting data. However, it is not a "magic bullet". There are a few potential drawbacks to keep in mind when using Fuzzy RegEx.
False Positive Potential
The basic idea behind Fuzzy RegEx is it performs character edits (swaps, insertions, and deletions) to allow a string of text to match a regular expression pattern where otherwise it would not. It looks for matches that are "close enough" and returns them as valid. Often "close enough" is just fine, but there is potential that you return something you do not want as well.
How different are the phrases "consecutive years" and "consecutive days"? Semantically, they are very different. You'll find this type of language in legal agreements. If you agree to do something every day for a number of consecutive days, that's very different from doing something every year for a number of consecutive years.
However, Fuzzy RegEx doesn't know what "years" and "days" mean, only how similar they are based on how few character edits it takes. What the Levenshtein distance between them is. So, what's the difference? Just three characters.
| Fuzzy Pattern: | consecutive days
|
| Text Data: | consecutive years |
- Swap the "y" in "years" for a
d- Gives you "consecutive dears"
- Next, delete the "e" in "dears"
- Gives you "consecutive dars"
- Last, swap the "r" in "dars" for a
y- Gives you "consecutive days"
There's a total 16 characters in the pattern. Therefore, each swap costs 6.25% confidence. 3 swaps for a total swap cost of 18.75%. Fuzzy RegEx would match "consecutive years" for consecutive days with an 81.25% confidence.
It's important to recognize this potential to produce false positive results. Depending on your situation this can effect the accuracy of your results. You may be able to eliminate them by adjusting your Minimum Similarity threshold, only allowing higher confident results. You may need to exempt these results through other methods, such as with an exclusion extractor. You may need human review of Grooper's automated results to ensure total accuracy. Or, you may be fine with the margin of error this produces.
Whatever your take on the situation, be aware Fuzzy RegEx's ability to capture more data comes at the potential cost of capturing inaccurate data.
Incompatibility With Infinite Quantifiers
In regular expression, single characters can be pattern matched, as can ranges of characters or sets of characters. For example, \d will match a single digit character (0 through 9) and \d{2,5} will match digit characters two to five digits in length.
There are two infinite length quantifiers:
- The plus quantifier
+, matching a character or character set one to any number in length. - The star quantifier
*, matching a character or character set zero to any number in length.
Fuzzy RegEx looks at each character in the pattern, one after the other, compared to the text data to see if a character swap is required. If there are 20 characters in the regex pattern, Fuzzy RegEx is looking for 20 potential character swaps for every string of characters in the text data. With the + and * quantifiers, the length of the regex pattern is potentially infinite characters in length. Fuzzy RegEx would never stop looking to see if characters could be swapped.
Fuzzy RegEx is therefore incompatible with infinite quantifiers. Grooper will give you an error if the Mode is set to Fuzzy RegEx and these quantifiers are used in your pattern.
It's Slower Than Standard RegEx
While this may or may not ultimately matter, it's worth pointing out. Because Fuzzy RegEx is considering much more of the text data for potential matches than the binary "match" or "doesn't match" approach to standard regex, it requires more processing power to perform. It is therefore, slower.
This may be the difference between a Data Type taking 2 ms to extract a value from a document compared to 20 ms. This may not seem like much, but when you're processing thousands of documents and/or using hundreds of extractors, the processing time can add up. Furthermore, depending on the complexity of the extractor, this speed for a single extractor to return a single value may be even greater.
How To
Fuzzy RegEx Basics
Note from the Author: Fuzzy RegEx is intended to resolve issues around OCR errors. However, for the purposes of this tutorial, the text is simply misspelled on the "document". Either way, whether misspelled or mis-OCR'd Fuzzy RegEx works to swap characters between the regular expression pattern and whatever is in the text data obtained via the Recognize activity.
Getting to a Pattern Editor
You can enable Fuzzy RegEx any time you are writing a regular expression pattern in a "Pattern Editor". The "Pattern Editor" can be accessed at several points in Grooper:
|
When configuring a Data Format
|
|
|
|
|
When configuring the Pattern property of a Data Type
|
|
|
When choosing the Text Pattern option for various extractors in a property panel and configuring its Pattern property Including (but not limited to) configuring a Data Field's Value Extractor property.
For any Grooper object whose property panel has an extractor property with an Text Pattern option, you can get to the Pattern Editor. |
|
|
When choosing the Internal option for various extractors in a property panel and configuring its Pattern property Including (but not limited to) choosing the Internal option for the Pattern-Based Separation provider's Value Extractor.
For any Grooper object whose property panel has an extractor property with an Internal option, you can get to the Pattern Editor. |
Fuzzy Match Weightings Basics
Note from the Author: Fuzzy RegEx is intended to resolve issues around OCR errors. However, for the purposes of this tutorial, the text is simply misspelled on the "document". Either way, whether misspelled or mis-OCR'd Fuzzy RegEx works to swap characters between the regular expression pattern and whatever is in the text data obtained via the Recognize activity.
Adjust Fuzzy Match Weightings
By default, it costs the same to swap, insert, or delete a character regardless of the character. However, certain OCR mistakes are more common than others. For example, an "O" might get misrecognized as a zero (or vice versa).
You can manually change the swap cost associated certain characters using the Fuzzy Match Weightings properties.Character Swaps
Imagine a "6" is often mis-OCR'd as a "G", resulting in "6rooper" in the text data instead of "Grooper". Knowing this is a common error, we might want to say the swap should not cost the full amount in this case.
This is exactly the reason why Fuzzy Match Weightings exist! In the list editor, you will adjust the swap cost using the following syntax:
(The mis-OCR'd character on the document) (The correct character in the regex pattern) = (Percentage change as decimal)
or, put another way:
{DocumentChar}{PatternChar}=Cost
So, if we wanted to say whenever Fuzzy RegEx swaps a "6" for a "G" it should only cost half the amount, you'd use the following syntax:
6G=0.5
Fuzzy Match Weightings work by first adjusting the normal baseline swap cost of a character.
- Base Swap Cost(DocumentChar, PatternChar) x Weighting Value = Weighted Swap Cost
Then the weighted swap cost is subtracted, instead of the baseline swap cost to determine the confidence score.
- 100% - Weighted Swap Cost = Confidence Score

Normally, the swap would cost a confidence of 14%, but the weighting
6G=0.5 multiplied that by "0.5", resulting in a swap cost of only 7%.
| ⚠ |
A message concerning case sensitivity: For the character pair, {DocumentChar}{PatternChar}, one of the characters is case sensitive where the other is not.
This means However, |
| FYI | As well as decreasing the cost to perform a swap, you can also increase the cost.
|
Character Inserts
|
As well as adjusting the cost to make a true "swap" between two characters, you can also adjust the cost to insert a character. For example, we could cause "Groper" to fuzzy match In the list editor, you will adjust the insert swap cost using the following syntax:
or, put another way:
So, if we want to tell Fuzzy RegEx the cost to insert an "o" into a text string should be half of what it normally is, you'd use the following syntax:
|
|||
|
Normally, inserting this "o" to match the regex
Then the weighted swap cost is subtracted, instead of the baseline swap cost to determine the confidence score.
|
| FYI | One common real-world use of the insert fuzzy match weighting is to adjust the cost to insert spaces. Space characters can often get inappropriately inserted or removed during OCR.
To adjust the insert cost of a space character, simply type a literal space between the parentheses, as seen below:
|
Character Deletions
|
You can also adjust the cost to remove or delete a character. For example, we could cause "Groooper" to fuzzy match In the list editor, you will adjust the delete swap cost using the following syntax:
or, put another way:
So, if we want to tell Fuzzy RegEx the cost to remove an "o" from a text string should be half of what it normally is, you'd use the following syntax:
|
|||
|
Normally, removing an "o" to match the regex
Fuzzy Match Weightings work by first adjusting the normal baseline swap cost of a character (in this case, a delete cost).
Then the weighted swap cost is subtracted, instead of the baseline swap cost to determine the confidence score.
|
| FYI | One common real-world use of the delete fuzzy match weighting is to adjust the cost to delete extra spaces. Space characters can often get inappropriately inserted or removed during OCR.
To adjust the delete cost of a space character, simply type a literal space between the parentheses, as seen below:
|
Multiple Match Weightings

You can see next to the Local Entries property, 6 entries is listed. Each of the fuzzy match weightings we've entered into the Local Entries "List Editor" will be used to adjust confidence scores. You're not just limited to a single match weighting. You can create a fuzzy match weighting list as large or small as you would like.
Fuzzy Match Lexicons
Furthermore, you may find you have a single list of fuzzy match weightings that works well across many different documents. You probably don't want to re-enter that list over and over again every time you write a new regex pattern. This is what the Included Lexicons property under Fuzzy Match Weightings is for.
Rather than manually inputting the weightings into the Local Entries property's list editor, you can just point to a list of these weightings already entered into a Grooper Lexicon.
Grooper ships with a few pre-built "Fuzzy Match Lexicons", or you can build your own. You can find the pre-built Lexicons navigating the following path in the Grooper Node Tree:(root node) > Global Resources > Lexicons > Downloads > Weightings
Required Mode
Now that we know the basics lets look at a more advanced problem and how to solve it. The solution will involve "Required Mode", which will throw out fuzzy match results if portions of the regular expression pattern are swapped.
The Goal
|
We have a simple goal. We want to extract the label "Taxable Value" from each of the five repeating sections of this document. At the end of this, we should have an extractor that returns five instances of the result "Taxable Value", one for each section.
|
|
|
Much like any other extraction problem, we will need to look for ways to throw out false positive results. For example we won't want to match "Taxable Value" in the label at the bottom of the page, "Total Taxable Value".
|
The Fuzzy Pattern
|
With a standard regex pattern of We definitely need to use Fuzzy RegEx mode in order to match our other labels. |
|
|
|
|
Let's deal with throwing out the false positive result. We should be able to do this easily with Tab Marking. Tab Marking can assist in regular expression pattern matching by providing a text character (
|
|
Tabs characters are markers of large gaps of horizontal space. What distinguishes the label "Taxable Value" and the words "Taxable Value" in the label "Total Taxable Value" is the fact the word "Total" is right before the words. The gaps of space on the left and right side of "Taxable Value" denotes that it is its own label on the document. Knowing this, we can add a
|
The Problem
|
Knowing the label not being returned is just two characters different from the regex pattern, let's try dropping the minimum similarity to allow for the extra swap.
|
|
|
Why is that the case? What about our tab characters in the Prefix and Suffix Patterns? Why aren't they throwing out this result? Here's the rub. Fuzzy RegEx matches the entire regular expression pattern, not just what's in the Value Pattern but what's in the Prefix and Suffix Patterns as well. In this case the You can see this clearly using the "Fuzzy Extraction Visualizer" tab. This is a visualization of the math behind the Levenshtein distance equation used to calculate the minimum number of character swaps to produce a match.
|
|
|
The Solution - Required Mode
However, we don't want this swap to occur. We want to use these tab characters as hard anchors for returning a result. The necessarily should be there to return a result.
We can accomplish this through "Required Mode". Required Mode allows you to place portions of the regular expression in a "required group". If any character inside this group is swapped by Fuzzy RegEx the result is discarded from the results list.
|
|
|
|
|
Immutable Characters
You may run into a situation where you have a character or set of characters you don't want a pattern to mutate into. There is a special fuzzy match weighting called an "Immutable" that allows you to do just this. These weightings can prevent results from returning if a certain character is swapped for another in your regex pattern.
Introduction
|
Imagine you have a document like the one here and you want to match the column header "TOTAL SALES". The "TOTAL SALES" header in the top table looks very clean and will likely OCR with no issues. However, the "A" in "TOTAL" in the bottom table is less likely to be OCR'd correctly. We'll probably need to use Fuzzy RegEx to match it. |
|
|
Let's see what happens in a Pattern Editor. With this document imported and OCR'd in Grooper, this bears out, but with an additional problem.
|
|
|
As we've seen in the previous example, large whitespace characters, such as The typical method for throwing out these false positives is to enable Tab Marking. The large amount of whitespace will be recognized tab character instead of a single space.
|
The Problem - The Fuzzy Result
Let's enable Fuzzy RegEx and see if we get a match.
- Select the Mode property, and select Fuzzy RegEx.
- We are indeed getting a match for the "TOTAL SALES" header in the bottom table.
- However we are also getting a match where the labels span across columns again.
In this case, the tab character in the text data is swapped for a space character to match the space character in the regular expression total sales
The Solution - Immutable Fuzzy Weightings
The "Immutable" fuzzy match weighting allows you to list a character set you do not want to mutate. This will allow us to tell Grooper if you see a\t character in the text data, do not swap it for any character in the regex pattern. It makes the swap cost infinite, throwing out the fuzzy match.

|
The syntax for the Immutable character set is as follows:
In this case, we just have a single character we do not want to mutate. Our Immutable weighting is as follows:
| ||
|
With this fuzzy match weighting in place, the two results that mutated a
|
| ⚠ | It can be easy to confuse Required Mode with Immutable Characters and vice versa. While they both throw out fuzzy matches, they do it in very different ways.
Required Mode prevents a pattern character in the regex from matching a different character in the text data.
Immutable Characters prevents a document character in the text data from matching a different character in the regex.
|













































