Fuzzy RegEx

From Grooper Wiki
Jump to navigation Jump to search

Fuzzy RegEx (also referred to as "fuzzy matching" or "fuzzy mode" or even just "fuzzy") allows regular expression patterns to match text within a set percentage of similarity. This can allow Grooper users to overcome unpredictable OCR errors when extracting data from documents.

Typically, regular expression will either match a string of text or it won't. If you're trying to match a word and the regex pattern is even a single character off from the text data, you will not return a result.

Fuzzy RegEx uses a Levenshtein distance equation to measure the difference between the regular expression and potential text matches. The percentage difference between the regex pattern and the matched text is expressed as a "confidence score" (also as a percentage). If the confidence is above a set threshold, the result is returned. If it is below the threshold, it is discarded.

For example, a text string that is 95% similar to the regex pattern may be off by just a single character. If the Minimum Similarity threshold is set to 90% the result would be returned, even though the pattern doesn't match the text exactly.


About

Fuzzy RegEx is a Match Mode option for an extractor's regular expression pattern. Any time you can get to a "Pattern Editor" in Grooper, you can take advantage of Fuzzy RegEx including:

  • When configuring a Data Format
  • When configuring the Pattern property of a Data Type
  • When choosing the Internal option for various extractors in a property panel and configuring its Pattern property
  • When choosing the Text Pattern option for various extractors in a property panel and configuring its Pattern property


In the "Pattern Editor" window, you can enable Fuzzy RegEx in two steps:

  1. Navigate to the "Properties" tab.
  2. Select the Mode property and choose Fuzzy RegEx

Fuzzy-regex-about-01.png


The Problem

Standard regular expression is binary. Either a result will match the pattern or it won't. The only "wiggle room" is baked into the standard syntax, such as the wildcard "dot" (.) metacharacter or the question mark (?) metacharacter optionally matching a character or character group. Even then, it's all or nothing. The text will 100% match the pattern as written, or it 0% will not.

What's the difference between these two strings of text?

One starts with a "G". The other starts with the number six.

If we used a simple regular expression to literally match Grooper, the first would match, and the second would not.

Fuzzy-regex-about-02.png
Fuzzy-regex-about-03.png
Fuzzy-regex-about-04.png


However, OCR results are imperfect. Characters are recognized as different from what is on the page. Spaces get inserted or removed between characters. Even with Image Processing and OCR Synthesis software as good as Grooper has, you can't assume 100% accurate OCR results on every document. And, even if the OCR results are one tiny character off, the pattern will fail to produce a match.

For example, it's not outside of the realm of possibility an OCR engine would see a cluster of pixels representing a "G" and erroneously recognize it as a "6", depending on the font used or quality of the scanned image.

How Fuzzy RegEx Solves the Problem

If you break up the string "Grooper" into characters, it is composed of seven characters. For the regex pattern Grooper, each character in the pattern matches each character in the string starting with "G" through the last "r". The only difference between "Grooper" and "6rooper" is the "G" is replaced by a "6".

But otherwise, six out of the seven characters match the regex pattern Grooper.

Fuzzy-regex-about-05.png

Being so close, it seems like the regex should match (Or at least we surely might like it to!). Fuzzy RegEx gives you the ability to return the results of these "near matches". As long as they are similar enough to the regular expression pattern, the result is still returned, allowing you to extract data from imperfect OCR results!

So, how do you say what should match and what shouldn't? After all, "6200932" is also seven characters long but certainly shouldn't match "Grooper". Should "Gopher" match? Should "Grapple"? Probably not. What is "similar enough"? How do you measure it? How do you control it?

Fuzzy RegEx measures the similarity between a regular expression pattern and a potential match in terms of percentage similarity. The regular expression pattern Grooper matches the text string "Grooper" at a 100% similarity. It is an exact match. Being seven characters long, a single character in the string "Grooper" makes up roughly 14% of the word.

Fuzzy-regex-about-06.png

Fuzzy RegEx matches the string "6rooper" at the cost of the percentage of the different character. 100% -14% = 86%. The string "6rooper" is therefore 86% similar to the regex pattern Grooper.

The result is then assigned a "Confidence Score" of 86%. The extractor is 86% "sure" or confident, the result matches the regex pattern.

Whether or not this result is actually returned by the extractor is determined by the Minimum Similarity property of Fuzzy RegEx. If this is set below 86%, the result will be returned. If it is set above 86%, it is not.

Fuzzy-regex-about-07.png


If you've ever used auto correct on your cell phone, you're engaging in something similar. Your cell phone's software has a dictionary of common English words. If you type something that doesn't match that dictionary, it will automatically swap the word you type with the most similar word in the dictionary. While the exact algorithm may vary from the one we used, it is based on some variant of a Levenshtein distance equation.


Character Swaps and Swap Cost

Another way to define the Levenshtein distance between two words is the "minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other" [1]. We often informally refer to these edits as "swaps" and the effect these edits have on a match's confidence the "swap cost". The most similar fuzzy matched text to the regular expression will have the least "swap cost" and therefore highest confidence score.

This also means Fuzzy RegEx not only matches imperfect matches, but it mutates them as well. In our example where our regex Grooper fuzzy matches "6rooper", the "6" would "swapped" for a "G", resulting in an 86% confident match.

Furthermore, the final output result would be altered to match the regex. In this case, "Grooper"

In this way, Fuzzy RegEx can be used to cleanse imperfect (or "dirty") OCR data, manipulating it into what you want.

Fuzzy-regex-about-08.png

Levenshtein distance will also measure the distance between two words where characters are missing or extra characters are present. The difference between "Grooper" and "Groooper" is just one extra "o" character. The difference between "Grooper" and "Groper" is one less "o" character.

As well as swapping characters to match the string, Fuzzy RegEx will insert or delete characters as well. Whether a true swap, insertion or deletion, the "swap cost" of a character is the same. In the examples below, a single character is deleted or inserted. Just as changing a single character in "6rooper" to match "Grooper" resulted in an 86% confidence score, inserting or deleting a single character to match the regex Grooper would also result in a confidence score of 86%.

Fuzzy Character Deletion Fuzzy Character Insertion
Fuzzy-regex-about-09.png Fuzzy-regex-about-10.png

Adjusting the Swap Cost: Fuzzy Match Weightings

Certain OCR errors are more common than others. Even a human being can have difficulty distinguishing between certain characters. For example, a capital "O" and a zero look extremely similar, even indistinguishable depending on the font. Especially if the data is a mix of numbers and letters, it may be difficult to say whether or not the character should be an "O" or a "0".

However, sometimes it's exceptionally easy to determine if the character should be an "O" or a "0". If you're looking for some kind of label or phrase on a document, it's highly unlikely a zero is going to be in the words you're looking for. English language doesn't use numerical characters as part of its alphabet. It doesn't make sense to have a "0" in the middle of a word. The context just doesn't provide for it.

Knowing this, it seems like some character swaps should not be as severe, depending on the circumstances. Fuzzy Match Weightings allow you to capitalize on this idea. This functionality allows you to manually adjust the base swap cost for swapping one character for another (or inserting or deleting a character). By adjusting the swap cost, you adjust the confidence score, resulting in a more confident match with the same number of character edits to match your regular expression pattern.

For example, the letter "G" can get misrecognized as a "6" during OCR, depending on the document, the fonts used, the OCR engine, or any of the other things that affect character recognition. As we've seen in our example above, the cost to swap a single character between the pattern Grooper and the string "6rooper" is roughly 14%.

What if we only want to return fuzzy matches with a confidence above 90%? We'd be out of luck here. The result would be discarded.

Fuzzy-regex-about-07.png

However, we know a "G" can get misrecognized for a "6" during OCR, and the pattern we're trying to match shouldn't even have numerical characters. We might say it should not cost the full 14% confidence to swap the characters.

Perhaps swapping a "6" in the document text data for the G in the pattern should only cost half the swap cost.

Fuzzy-regex-about-11.png

With a Fuzzy Match Weighting, we can do just that. This weighting would adjust the base swap cost for a "6" for a G. The base swap cost of 14% would be halved, down to a weighted swap cost of 7%.

Now, the fuzzy match's confidence score is 93%. If we're only returning fuzzy matches with a confidence above 90%, the result would be returned.

Fuzzy-regex-about-12.png

For more information on Fuzzy Match Weigthings, visit the How To section of this article.

Words of Caution

Fuzzy RegEx is an exceptional way to resolve unforeseen and unpreventable OCR errors when extracting data. However, it is not a "magic bullet". There are a few potential drawbacks to keep in mind when using Fuzzy RegEx.

False Positive Potential

The basic idea behind Fuzzy RegEx is it performs character edits (swaps, insertions, and deletions) to allow a string of text to match a regular expression pattern where otherwise it would not. It looks for matches that are "close enough" and returns them as valid. Often "close enough" is just fine, but there is potential that you return something you do not want as well.

How different are the phrases "consecutive years" and "consecutive days"? Semantically, they are very different. You'll find this type of language in legal agreements. If you agree to do something every day for a number of consecutive days, that's very different from doing something every year for a number of consecutive years.

However, Fuzzy RegEx doesn't know what "years" and "days" mean, only how similar they are based on how few character edits it takes. What the Levenshtein distance between them is. So, what's the difference? Just three characters.

Fuzzy Pattern: consecutive days
Text Data: consecutive years
  1. Swap the "y" in "years" for a d
    • Gives you "consecutive dears"
  2. Next, delete the "e" in "dears"
    • Gives you "consecutive dars"
  3. Last, swap the "r" in "dars" for a y
    • Gives you "consecutive days"

There's a total 16 characters in the pattern. Therefore, each swap costs 6.25% confidence. 3 swaps for a total swap cost of 18.75%. Fuzzy RegEx would match "consecutive years" for consecutive days with an 81.25% confidence.

It's important to recognize this potential to produce false positive results. Depending on your situation this can effect the accuracy of your results. You may be able to eliminate them by adjusting your Minimum Similarity threshold, only allowing higher confident results. You may need to exempt these results through other methods, such as with an exclusion extractor. You may need human review of Grooper's automated results to ensure total accuracy. Or, you may be fine with the margin of error this produces.

Whatever your take on the situation, be aware Fuzzy RegEx's ability to capture more data comes at the potential cost of capturing inaccurate data.

Incompatibility With Infinite Quantifiers

In regular expression, single characters can be pattern matched, as can ranges of characters or sets of characters. For example, \d will match a single digit character (0 through 9) and \d{2,5} will match digit characters two to five digits in length.

There are two infinite length quantifiers:

  • The plus quantifier +, matching a character or character set one to any number in length.
  • The star quantifier *, matching a character or character set zero to any number in length.

Fuzzy RegEx looks at each character in the pattern, one after the other, compared to the text data to see if a character swap is required. If there are 20 characters in the regex pattern, Fuzzy RegEx is looking for 20 potential character swaps for every string of characters in the text data. With the + and * quantifiers, the length of the regex pattern is potentially infinite characters in length. Fuzzy RegEx would never stop looking to see if characters could be swapped.

Fuzzy RegEx is therefore incompatible with infinite quantifiers. Grooper will give you an error if the Mode is set to Fuzzy RegEx and these quantifiers are used in your pattern.

It's Slower Than Standard RegEx

While this may or may not ultimately matter, it's worth pointing out. Because Fuzzy RegEx is considering much more of the text data for potential matches than the binary "match" or "doesn't match" approach to standard regex, it requires more processing power to perform. It is therefore, slower.

This may be the difference between a Data Type taking 2 ms to extract a value from a document compared to 20 ms. This may not seem like much, but when you're processing thousands of documents and/or using hundreds of extractors, the processing time can add up. Furthermore, depending on the complexity of the extractor, this speed for a single extractor to return a single value may be even greater.

How To

Fuzzy RegEx Basics

Note from the Author: Fuzzy RegEx is intended to resolve issues around OCR errors. However, for the purposes of this tutorial, the text is simply misspelled on the "document". Either way, whether misspelled or mis-OCR'd Fuzzy RegEx works to swap characters between the regular expression pattern and whatever is in the text data obtained via the Recognize activity.

Getting to a Pattern Editor

You can enable Fuzzy RegEx any time you are writing a regular expression pattern in a "Pattern Editor". The "Pattern Editor" can be accessed at several points in Grooper:

When configuring a Data Format

  1. Select a Data Format
    • Note: Data Formats are always and only children of Data Type objects.
  2. That's it! That's all you need to do. Data Formats are very simple extractors. Their configuration panel is itself a "Pattern Editor" window.

Fuzzy-regex-how-to-01.png

When configuring the Pattern property of a Data Type

  1. Select a Data Type.
  2. Select the Pattern property.
  3. Press the ellipsis button at the end to bring up the "Pattern Editor" window.

Fuzzy-regex-how-to-02.png

When choosing the Text Pattern option for various extractors in a property panel and configuring its Pattern property

Including (but not limited to) configuring a Data Field's Value Extractor property.

  1. Select a Data Field
  2. Select the Value Extractor property and choose Text Pattern.
  3. Expand the Text Pattern sub properties, select Pattern
  4. Press the ellipsis button at the end to bring up the "Pattern Editor" window.

For any Grooper object whose property panel has an extractor property with an Text Pattern option, you can get to the Pattern Editor.

Fuzzy-regex-how-to-03.png

When choosing the Internal option for various extractors in a property panel and configuring its Pattern property

Including (but not limited to) choosing the Internal option for the Pattern-Based Separation provider's Value Extractor.

  1. Select a Separation Profile
  2. Select the Provider property and choose Pattern-Based Separation.
  3. Expand the Pattern-Based Separation sub properties. Select Value Pattern, and expand it's sub properties. Select Type and choose Internal.
  4. Expand the Internal sub properties and select the Pattern property.
  5. Press the ellipsis button at the end to bring up the "Pattern Editor" window.

For any Grooper object whose property panel has an extractor property with an Internal option, you can get to the Pattern Editor.

Fuzzy-regex-how-to-04.png

Enable Fuzzy RegEx

Once you are in a "Pattern Editor" window, enabling Fuzzy RegEx is very simple.

  1. The first thing you'll want to do is type your regular expression pattern like normal in the Value Pattern editor.
    • For this tutorial we are simply matching the word Grooper.
  2. With standard regex matching, we only produce a single result, the string "Grooper" that matches the regex pattern perfectly.

Fuzzy-regex-how-to-05.png

Fuzzy RegEx is a Mode property option, found in the "Properties" tab.

  1. Navigate to the "Properties" tab.
  2. Select the Mode property and choose Fuzzy RegEx

Fuzzy-regex-about-01.png

Adjust Minimum Similarity

  1. As configured currently, we still only get a single result, the perfect match for the string "Grooper".
    • From our information earlier in the article, the strings "6rooper", "Groper", and "Groooper" are just one character different from the regex pattern Grooper. These equate to roughly 86% similar to Grooper. The "swap cost" to match these strings to the pattern is 14% confidence.
  2. The reason these fuzzy matches are not showing up in our results list is due to the Minimum Similarity property.
    • Since these strings only match with a confidence score of 86%, they fall below the default Minimum Similarity threshold of 90% and are thrown out.

Fuzzy-regex-how-to-06.png

  1. Change the Minimum Similarity to 80%.
  2. Now, those three results are returned with a value in the "Confidence" column of 86%, indicating the "swap cost" it took to mutate the string in the text data to match the regex pattern Grooper
  3. Notice the string "Gr00per" is not returned. This string is not one, but two characters different from the regex pattern Grooper.
    • It would cost more confidence to swap the second character, resulting in an even lower confidence score.

Fuzzy-regex-how-to-07.png

  1. Change the Minimum Similarity to 70%
  2. Now, this last string is returned with a confidence score of 71%.

Fuzzy-regex-how-to-08.png

Fuzzy Match Weightings Basics

Note from the Author: Fuzzy RegEx is intended to resolve issues around OCR errors. However, for the purposes of this tutorial, the text is simply misspelled on the "document". Either way, whether misspelled or mis-OCR'd Fuzzy RegEx works to swap characters between the regular expression pattern and whatever is in the text data obtained via the Recognize activity.

Adjust Fuzzy Match Weightings

By default, it costs the same to swap, insert, or delete a character regardless of the character. However, certain OCR mistakes are more common than others. For example, an "O" might get misrecognized as a zero (or vice versa).

You can manually change the swap cost associated certain characters using the Fuzzy Match Weightings properties.

  1. In the Fuzzy Matching Options section, expand the Fuzzy Match Weightings properties.
    • You can alter the match weightings using either the Local Entries or the Included Lexicons properties (or both). We will focus on Local Entries for the time being.
  2. Select the Local Entries property and press the ellipsis button at the end.
  3. This will bring up a "List Editor" window.

Depending on what you want to do, what you will enter here will change somewhat. Character swaps, inserts and deletions all have different syntax to alter the swap cost of certain characters.

Fuzzy-regex-how-to-09.png

Character Swaps

Imagine a "6" is often mis-OCR'd as a "G", resulting in "6rooper" in the text data instead of "Grooper". Knowing this is a common error, we might want to say the swap should not cost the full amount in this case.

This is exactly the reason why Fuzzy Match Weightings exist! In the list editor, you will adjust the swap cost using the following syntax:

(The mis-OCR'd character on the document) (The correct character in the regex pattern) = (Percentage change as decimal)

or, put another way:

{DocumentChar}{PatternChar}=Cost

So, if we wanted to say whenever Fuzzy RegEx swaps a "6" for a "G" it should only cost half the amount, you'd use the following syntax:

6G=0.5

Fuzzy Match Weightings work by first adjusting the normal baseline swap cost of a character.

  • Base Swap Cost(DocumentChar, PatternChar) x Weighting Value = Weighted Swap Cost

Then the weighted swap cost is subtracted, instead of the baseline swap cost to determine the confidence score.

  • 100% - Weighted Swap Cost = Confidence Score

Fuzzy-regex-how-to-10.png

Normally, the swap would cost a confidence of 14%, but the weighting 6G=0.5 multiplied that by "0.5", resulting in a swap cost of only 7%.


  1. In this case, the Minimum Similarity is bumped back up to the default of 90%.
  2. The string "6rooper" now returns (mutated to "Grooper") with a confidence score of 93% instead of 86%.


Fuzzy Match Weightings work by first adjusting the normal baseline swap cost of a character.

  • Base Swap Cost({DocumentChar}{PatternChar}) x Weighting Value = Weighted Swap Cost
  • 14%(6, G) x 0.5 = 7%

Then the weighted swap cost is subtracted, instead of the baseline swap cost to determine the confidence score.

  • 100% - Weighted Swap Cost = Confidence Score
  • 100% - 7% = 93%

Fuzzy-regex-how-to-11.png

A message concerning case sensitivity:

For the character pair, {DocumentChar}{PatternChar}, one of the characters is case sensitive where the other is not.

{DocumentChar} is case sensitive.
{PatternChar} is not case sensitive.

This means s5=0.5 would only adjust the cost to swap a lowercase "s" in the text data for a "5" in the regex pattern.

However, 5s=0.5 would adjust either the cost to swap a "5" in the text data for a lowercase "s" or uppercase "S" in the regex pattern.

FYI As well as decreasing the cost to perform a swap, you can also increase the cost.

6G=0.5 would decrease the cost to perform a swap from "6" to "G" by 50%.

6G=1.5 would increase the cost to perform a swap from "6" to "G" by 50%.

Character Inserts

As well as adjusting the cost to make a true "swap" between two characters, you can also adjust the cost to insert a character. For example, we could cause "Groper" to fuzzy match Grooper at a higher confidence score using fuzzy match weighting that decreases the cost of inserting an "o" character.

In the list editor, you will adjust the insert swap cost using the following syntax:

Insert(The character or characters you want to insert) = (Percentage change as a decimal)

or, put another way:

Insert(CharSet)=Cost

So, if we want to tell Fuzzy RegEx the cost to insert an "o" into a text string should be half of what it normally is, you'd use the following syntax:

Insert(o)=0.5
FYI You can also adjust the cost of multiple character inserts all at once by simply including more characters in the character set.

For example, if you wanted to reduce the cost to insert a period (.), comma (,) or semicolon(;) by half, you could use the following fuzzy match weighting:

Insert(.,;)=0.5

Fuzzy-regex-how-to-12.png

Normally, inserting this "o" to match the regex Grooper would cost a confidence of 14%, but the weighting Insert(o)=0.5 multiplied that by "0.5", resulting in a swap cost of only 7%.


  1. In this case, the Minimum Similarity is bumped back up to the default of 90%.
  2. The string "Groper" now returns (mutated to "Grooper") with a confidence score of 93% instead of 86%.


Fuzzy Match Weightings work by first adjusting the normal baseline swap cost of a character (in this case, an insert cost).

  • Base Insert Cost(CharSet) x Weighting Value = Weighted Swap Cost
  • 14%(o) x 0.5 = 7%

Then the weighted swap cost is subtracted, instead of the baseline swap cost to determine the confidence score.

  • 100% - Weighted Swap Cost = Confidence Score
  • 100% - 7% = 93%

Fuzzy-regex-how-to-13.png

FYI One common real-world use of the insert fuzzy match weighting is to adjust the cost to insert spaces. Space characters can often get inappropriately inserted or removed during OCR.

To adjust the insert cost of a space character, simply type a literal space between the parentheses, as seen below:

Insert( )=0.25

Character Deletions

You can also adjust the cost to remove or delete a character. For example, we could cause "Groooper" to fuzzy match Grooper at a higher confidence score using fuzzy match weighting that decreases the cost of deleting an "o" character.

In the list editor, you will adjust the delete swap cost using the following syntax:

Delete(The character or characters you want to delete) = (Percentage change as a decimal)

or, put another way:

Delete(CharSet)=Cost

So, if we want to tell Fuzzy RegEx the cost to remove an "o" from a text string should be half of what it normally is, you'd use the following syntax:

Delete(o)=0.5
FYI You can also adjust the cost of multiple character deletions all at once by simply including more characters in the character set.

For example, if you wanted to reduce the cost to delete a period (.), comma (,) or semicolon(;) by half, you could use the following fuzzy match weighting:

Delete(.,;)=0.5

Fuzzy-regex-how-to-14.png

Normally, removing an "o" to match the regex Grooper would cost a confidence of 14%, but the weighting Delete(o)=0.5 multiplied that by "0.5", resulting in a swap cost of only 7%.


  1. In this case, the Minimum Similarity is bumped back up to the default of 90%.
  2. The string "Groooper" now returns (mutated to "Grooper") with a confidence score of 93% instead of 86%.


Fuzzy Match Weightings work by first adjusting the normal baseline swap cost of a character (in this case, a delete cost).

  • Base Delete Cost(CharSet) x Weighting Value = Weighted Swap Cost
  • 14%(o) x 0.5 = 7%

Then the weighted swap cost is subtracted, instead of the baseline swap cost to determine the confidence score.

  • 100% - Weighted Swap Cost = Confidence Score
  • 100% - 7% = 93%

Fuzzy-regex-how-to-15.png

FYI One common real-world use of the delete fuzzy match weighting is to adjust the cost to delete extra spaces. Space characters can often get inappropriately inserted or removed during OCR.

To adjust the delete cost of a space character, simply type a literal space between the parentheses, as seen below:

Delete( )=0.25

Multiple Match Weightings

As its name implies, the "List Editor" allows you to collect multiple fuzzy match weightings.

Here, we have six items in our list, each changing the base swap costs of certain characters differently.

Fuzzy-regex-how-to-16.png

You can see next to the Local Entries property, 6 entries is listed.

Each of the fuzzy match weightings we've entered into the Local Entries "List Editor" will be used to adjust confidence scores.

You're not just limited to a single match weighting. You can create a fuzzy match weighting list as large or small as you would like.

Fuzzy-regex-how-to-17.png

Fuzzy Match Lexicons

Furthermore, you may find you have a single list of fuzzy match weightings that works well across many different documents. You probably don't want to re-enter that list over and over again every time you write a new regex pattern. This is what the Included Lexicons property under Fuzzy Match Weightings is for.

Rather than manually inputting the weightings into the Local Entries property's list editor, you can just point to a list of these weightings already entered into a Grooper Lexicon.

Grooper ships with a few pre-built "Fuzzy Match Lexicons", or you can build your own. You can find the pre-built Lexicons navigating the following path in the Grooper Node Tree:

(root node) > Global Resources > Lexicons > Downloads > Weightings

The Lexicon selected here contains some adjusted weightings for common OCR errors such as erroneously recognizing an "O" as a zero.

Fuzzy-regex-how-to-18.png

To reference a Lexicon containing fuzzy match weightings:

  1. Under the Fuzzy Match Weightings sub-properties, select the Included Lexicons property.
  2. Using the dropdown menu, select one or more Lexicons containing fuzzy match weightings by checking the box next to the Lexicon.

Fuzzy-regex-how-to-19.png

Two fuzzy matches are returned using the weightings in this Lexicon.

  1. The string "6rooper" returns a result due to the weighting 6G=0.5
    • The cost to swap the "6" for a "G" is reduced by 50%. The normal cost to swap the character is reduced from 14% to 7%, resulting in an overall confidence score of 93% for the result.
  2. The string "Gr00per" returns a result due to the weighting 0O=0.25
    • The cost to swap the "0" for a "o" is reduced by 25%. The normal cost to swap the character is reduced from 14% to 3.5%, resulting in an overall confidence score of 93% for the result.

Fuzzy-regex-how-to-20.png

Furthermore, you are not limited to either Local Entries or Included Lexicons. You can use both

Here, the Fuzzy Match Weightings properties are configured to reference the "Fuzzy Match Weightings" Lexicon in the Global Resources folder as well as Insert(o)=0.5 and Delete(o)=0.5 in the Local Entries.

Fuzzy-regex-how-to-21.png

Required Mode

Now that we know the basics lets look at a more advanced problem and how to solve it. The solution will involve "Required Mode", which will throw out fuzzy match results if portions of the regular expression pattern are swapped.

The Goal

We have a simple goal. We want to extract the label "Taxable Value" from each of the five repeating sections of this document. At the end of this, we should have an extractor that returns five instances of the result "Taxable Value", one for each section.


However, this document was purposefully degraded to have some troubling OCR results. We will need to use Fuzzy RegEx to return most of these labels.

Fuzzy-regex-how-to-22.png

Much like any other extraction problem, we will need to look for ways to throw out false positive results. For example we won't want to match "Taxable Value" in the label at the bottom of the page, "Total Taxable Value".


Since Fuzzy RegEx swaps characters to mutate text data to match a regular expression pattern, there is a danger that some text in the document is improperly mutated, giving a false positive result.

Fuzzy-regex-how-to-23.png

The Fuzzy Pattern

With a standard regex pattern of taxable value, only two (out of our five desired) results are returned.

We definitely need to use Fuzzy RegEx mode in order to match our other labels.

Fuzzy-regex-how-to-24.png

  1. Here, the Mode property has been changed to Fuzzy RegEx, using the default settings.
  2. Notice we are now returning labels whose text does not perfectly match the regex pattern taxable value.
    • This label's text data is "Taxable Valuo", just one character off.
  3. We also are not returning the second of our five "Taxable Value" labels.
    • This label's text data is "Taxabte Valuo", two characters off.
  4. We are also improperly returning part of the "Total Taxable Value" label.

Fuzzy-regex-how-to-25.png

Let's deal with throwing out the false positive result. We should be able to do this easily with Tab Marking. Tab Marking can assist in regular expression pattern matching by providing a text character (\t) that can match large amounts of whitespace between characters.

  1. Expand Preprocessing Options and enable the Tab Marking property.

Fuzzy-regex-how-to-26.png

  1. Switch back to the "Pattern Editor" tab.

Tabs characters are markers of large gaps of horizontal space. What distinguishes the label "Taxable Value" and the words "Taxable Value" in the label "Total Taxable Value" is the fact the word "Total" is right before the words. The gaps of space on the left and right side of "Taxable Value" denotes that it is its own label on the document.

Knowing this, we can add a \t character to the Prefix and Suffix Patterns to distinguish the text in the two different labels. The tab character can act as an "anchor" for us, only returning a result if the Value Pattern is anchored between large gaps of whitespace on either side.

  1. Enter \t in the Prefix Pattern
  2. Enter \t in the Suffix Pattern
  3. Now our false positive result is thrown out!
    • Since the space before "Taxable Value" in "Total Taxable Value" is just a regular space character and not a tab character, it does not match the \t Prefix Pattern and the result is thrown out.
  4. However, we haven't resolved the problem of the second of our five "Taxable Value" labels not returning. It is still too dissimilar to return with the default 90% minimum similarity.
    • Most of our labels are just off by one character, but this label's text data is "Taxabte Valuo" which is off by two.

Fuzzy-regex-how-to-27.png

The Problem

Knowing the label not being returned is just two characters different from the regex pattern, let's try dropping the minimum similarity to allow for the extra swap.

  1. Navigate back to the "Properties" tab.
  2. Adjust the Minimum Similarity property to 85%.
  3. Success! The label is returned with a confidence of 87%.
  4. However... Failure. Our false positive is returned as well, also with a confidence of 87%.

Fuzzy-regex-how-to-28.png

Why is that the case? What about our tab characters in the Prefix and Suffix Patterns? Why aren't they throwing out this result?

Here's the rub. Fuzzy RegEx matches the entire regular expression pattern, not just what's in the Value Pattern but what's in the Prefix and Suffix Patterns as well. In this case the \t in the Prefix Pattern is being swapped with the space character before "Taxable Value" in "Total Taxable Value".

You can see this clearly using the "Fuzzy Extraction Visualizer" tab. This is a visualization of the math behind the Levenshtein distance equation used to calculate the minimum number of character swaps to produce a match.

  1. Switch to the "Fuzzy Extraction Visualizer" tab.
  2. Select the last result in the "Results" panel. This is our false positive result.
  3. The characters in the full regular expression pattern \ttaxable value\t make up the the rows of this grid.
  4. The characters in the text data producing the fuzzy match make up the columns of this grid.

Fuzzy-regex-how-to-29.png

  1. Swaps are highlighted in red where the pattern character on the left meets the text data character from the top.
  2. Here, the \t character in the Prefix Pattern of the regex...
  3. ...is swapped for a space character in the text.

Fuzzy-regex-how-to-30.png

The Solution - Required Mode

However, we don't want this swap to occur. We want to use these tab characters as hard anchors for returning a result. The necessarily should be there to return a result.

We can accomplish this through "Required Mode". Required Mode allows you to place portions of the regular expression in a "required group". If any character inside this group is swapped by Fuzzy RegEx the result is discarded from the results list.

  1. Right click before the \t character in the Prefix Pattern to start the required group.
  2. Select "Start Required Mode"
    • Note there is also a keyboard shortcut for this command, Ctrl + R.

Fuzzy-regex-how-to-31.png

  1. This will insert (?r) into the regex pattern. Anything after these characters will be required. This will force fuzzy character swaps from occurring.
  2. But if we start something, we better stop it too. With Required Mode starting at the start of the Prefix Pattern this is making the entire regex required, effectively nullifying our fuzzy matches.
  3. We only want to make the \t character required. Right-click at the end of the character to mark where Required Mode ends.
  4. Select "End Required Mode".
    • Note there is a keyboard shortcut for this command as well, Ctrl + Shift + R

Fuzzy-regex-how-to-32.png

  1. This will insert (?-r) into the regex pattern. This terminates the required group. Anything falling between (?r) and (?-r) will be required, preventing Fuzzy RegEx from returning a result of the characters are swapped.
  2. Now, with the \t Prefix Pattern placed in a required group, the false positive is discarded.
    • Seeing that the \t character was swapped for a space character and the \t character is required, the fuzzy match is ignored and the result is no longer returned.
  3. It would also be a good idea to place the Suffix Pattern in a required group in this case as well.
    • In effect, we only want to fuzzy match the Value Pattern.

Fuzzy-regex-how-to-33.png

Immutable Characters

You may run into a situation where you have a character or set of characters you don't want a pattern to mutate into. There is a special fuzzy match weighting called an "Immutable" that allows you to do just this. These weightings can prevent results from returning if a certain character is swapped for another in your regex pattern.

Introduction

Imagine you have a document like the one here and you want to match the column header "TOTAL SALES".

The "TOTAL SALES" header in the top table looks very clean and will likely OCR with no issues. However, the "A" in "TOTAL" in the bottom table is less likely to be OCR'd correctly. We'll probably need to use Fuzzy RegEx to match it.

Fuzzy-regex-how-to-34.png

Let's see what happens in a Pattern Editor. With this document imported and OCR'd in Grooper, this bears out, but with an additional problem.

  1. As predicted, the bottom "TOTAL SALES" header is not matched by normal regular expression.
  2. However, due to this table's format, we're getting some false positives where the "TOTAL" in "TOTAL REVENUE" spans to the "SALES" in "SALES TAX", producing a match for our regex, total sales

Fuzzy-regex-how-to-35.png

As we've seen in the previous example, large whitespace characters, such as \r, \n, \t, and \f, are often used as spatial anchors to distinguish one result from another.

The typical method for throwing out these false positives is to enable Tab Marking. The large amount of whitespace will be recognized tab character instead of a single space.

  1. Navigate to the "Properties" tab.
  2. Expand the Preprocessing Options property and enable the Tab Marking property.
  3. This leaves us with a single result, the column header perfectly matching the regex total sales
  4. However, the bottom table's "TOTAL SALES" header does indeed have some OCR errors.

Fuzzy-regex-how-to-36.png

The Problem - The Fuzzy Result

Let's enable Fuzzy RegEx and see if we get a match.

  1. Select the Mode property, and select Fuzzy RegEx.
  2. We are indeed getting a match for the "TOTAL SALES" header in the bottom table.
  3. However we are also getting a match where the labels span across columns again.

In this case, the tab character in the text data is swapped for a space character to match the space character in the regular expression total sales

The Solution - Immutable Fuzzy Weightings

The "Immutable" fuzzy match weighting allows you to list a character set you do not want to mutate. This will allow us to tell Grooper if you see a \t character in the text data, do not swap it for any character in the regex pattern. It makes the swap cost infinite, throwing out the fuzzy match.

  1. Expand the Fuzzy Match Weightings properties.
  2. Select Local Entries and press the ellipsis button at the end.
  3. This will bring up the Fuzzy Match Weightings List Editor.

Fuzzy-regex-how-to-37.png

The syntax for the Immutable character set is as follows:

Immutable={CharacterList}

In this case, we just have a single character we do not want to mutate. Our Immutable weighting is as follows:

Immutable=\t

Fuzzy-regex-how-to-38.png

With this fuzzy match weighting in place, the two results that mutated a \t character into a single space character are discarded, leaving us the desired results.

FYI Often it is the case all the large white space characters, including tab characters, should be considered immutable. Since they are used to distinguish large gaps in whitespace from a single space, very rarely (if ever) does it make sense to swap them for a single space characters.

The most common Immutable weighting is as follows:

Immutable=\r\n\t\f

Fuzzy-regex-how-to-39.png

It can be easy to confuse Required Mode with Immutable Characters and vice versa. While they both throw out fuzzy matches, they do it in very different ways.


The difference to keep in mind between Required Mode and Immutable Characters is:

Required Mode prevents a pattern character in the regex from matching a different character in the text data.

  • Another way of thinking about it is Required Mode makes that portion of the regex behave as if it were not fuzzy. An exact match is required.
  • Immutable Characters prevents a document character in the text data from matching a different character in the regex.

  • Another way of thinking about it is Immutable Characters make the cost to swap a character in the text data infinite. Remember the normal swap weighting syntax:
  • {Document Character}{Pattern Character}=Cost
    Immutable characters make the cost to swap infinite, or:
    {Immutable Character}{Any Pattern Character}=Infinite