Fuzzy RegEx (Concept)

From Grooper Wiki
(Redirected from Fuzzy RegEx)

This article was migrated from an older version and has not been updated for the current version of Grooper.

This tag will be removed upon article review and update.

This article is about the current version of Grooper.

Note that some content may still need to be updated.

20252024 20232.90

Fuzzy RegEx is Grooper's use of fuzzy logic within Extractor Types that leverage regular expressions to match patterns. Fuzzy RegEx allows extractors to overcome defects in a document's OCR results to accurately return results. Fuzzy RegEx is enabled by enabling the Fuzzy Matching property.

Typically, regular expression will either match a string of text or it won't. If you're trying to match a word and the regex pattern is even a single character off from the text data, you will not return a result.

Fuzzy RegEx uses a Levenshtein distance equation to measure the difference between the regular expression and potential text matches. The percentage difference between the regex pattern and the matched text is expressed as a "confidence score" (also as a percentage). If the confidence is above a set threshold, the result is returned. If it is below the threshold, it is discarded.

For example, a text string that is 95% similar to the regex pattern may be off by just a single character. If the Minimum Similarity threshold is set to 90% the result would be returned, even though the pattern doesn't match the text exactly.

You may download the ZIP(s) below and upload it into your own Grooper environment (version 2024). The first contains one or more Batches of sample documents. The second contains one or more Projects with resources used in examples throughout this article.

About

Fuzzy RegEx is a Match Mode option for an extractor's regular expression pattern. Any time you can get to an Editor Window in Grooper, you can take advantage of Fuzzy RegEx for Pattern Match. This can be done on:

  • Data Fields
  • Value Readers

In the "Local Extractor" Editor window, you can enable Fuzzy RegEx in two steps:

  1. Navigate to the "Properties" tab.
  2. Enable Fuzzy Matching.

The Problem

Standard regular expression is binary. Either a result will match the pattern or it won't. The only "wiggle room" is baked into the standard syntax, such as the wildcard "dot" (.) metacharacter or the question mark (?) metacharacter optionally matching a character or character group. Even then, it's all or nothing. The text will 100% match the pattern as written, or it 0% will not.

What's the difference between these two strings of text?

One starts with a "G". The other starts with the number six.

If we used a simple regular expression to literally match Grooper, the first would match, and the second would not.


However, OCR results are imperfect. Characters are recognized as different from what is on the page. Spaces get inserted or removed between characters. Even with Image Processing and OCR Synthesis software as good as Grooper has, you can't assume 100% accurate OCR results on every document. And, even if the OCR results are one tiny character off, the pattern will fail to produce a match.

For example, it's not outside of the realm of possibility an OCR engine would see a cluster of pixels representing a "G" and erroneously recognize it as a "6", depending on the font used or quality of the scanned image.

How Fuzzy RegEx Solves the Problem

If you break up the string "Grooper" into characters, it is composed of seven characters. For the regex pattern Grooper, each character in the pattern matches each character in the string starting with "G" through the last "r". The only difference between "Grooper" and "6rooper" is the "G" is replaced by a "6".

But otherwise, six out of the seven characters match the regex pattern Grooper.

Being so close, it seems like the regex should match (Or at least we surely might like it to!). Fuzzy RegEx gives you the ability to return the results of these "near matches". As long as they are similar enough to the regular expression pattern, the result is still returned, allowing you to extract data from imperfect OCR results!

So, how do you say what should match and what shouldn't? After all, "6200932" is also seven characters long but certainly shouldn't match "Grooper". Should "Gopher" match? Should "Grapple"? Probably not. What is "similar enough"? How do you measure it? How do you control it?

Fuzzy RegEx measures the similarity between a regular expression pattern and a potential match in terms of percentage similarity. The regular expression pattern Grooper matches the text string "Grooper" at a 100% similarity. It is an exact match. Being seven characters long, a single character in the string "Grooper" makes up roughly 14% of the word.

Fuzzy RegEx matches the string "6rooper" at the cost of the percentage of the different character. 100% -14% = 86%. The string "6rooper" is therefore 86% similar to the regex pattern Grooper.

The result is then assigned a "Confidence Score" of 86%. The extractor is 86% "sure" or confident, the result matches the regex pattern.

Whether or not this result is actually returned by the extractor is determined by the Minimum Similarity property of Fuzzy RegEx. If this is set below 86%, the result will be returned. If it is set above 86%, it is not.


If you've ever used auto correct on your cell phone, you're engaging in something similar. Your cell phone's software has a dictionary of common English words. If you type something that doesn't match that dictionary, it will automatically swap the word you type with the most similar word in the dictionary. While the exact algorithm may vary from the one we used, it is based on some variant of a Levenshtein distance equation.

Character Swaps and Swap Cost

Another way to define the Levenshtein distance between two words is the "minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other" [1]. We often informally refer to these edits as "swaps" and the effect these edits have on a match's confidence the "swap cost". The most similar fuzzy matched text to the regular expression will have the least "swap cost" and therefore highest confidence score.

This also means Fuzzy RegEx not only matches imperfect matches, but it mutates them as well. In our example where our regex Grooper fuzzy matches "6rooper", the "6" would "swapped" for a "G", resulting in an 86% confident match.

Furthermore, the final output result would be altered to match the regex. In this case, "Grooper"

In this way, Fuzzy RegEx can be used to cleanse imperfect (or "dirty") OCR data, manipulating it into what you want.

Levenshtein distance will also measure the distance between two words where characters are missing or extra characters are present. The difference between "Grooper" and "Groooper" is just one extra "o" character. The difference between "Grooper" and "Groper" is one less "o" character.

As well as swapping characters to match the string, Fuzzy RegEx will insert or delete characters as well. Whether a true swap, insertion or deletion, the "swap cost" of a character is the same. In the examples below, a single character is deleted or inserted. Just as changing a single character in "6rooper" to match "Grooper" resulted in an 86% confidence score, inserting or deleting a single character to match the regex Grooper would also result in a confidence score of 86%.

Fuzzy Character Deletion Fuzzy Character Insertion

Adjusting the Swap Cost: Fuzzy Match Weightings

Certain OCR errors are more common than others. Even a human being can have difficulty distinguishing between certain characters. For example, a capital "O" and a zero look extremely similar, even indistinguishable depending on the font. Especially if the data is a mix of numbers and letters, it may be difficult to say whether or not the character should be an "O" or a "0".

However, sometimes it's exceptionally easy to determine if the character should be an "O" or a "0". If you're looking for some kind of label or phrase on a document, it's highly unlikely a zero is going to be in the words you're looking for. English language doesn't use numerical characters as part of its alphabet. It doesn't make sense to have a "0" in the middle of a word. The context just doesn't provide for it.

Knowing this, it seems like some character swaps should not be as severe, depending on the circumstances. Fuzzy Match Weightings allow you to capitalize on this idea. This functionality allows you to manually adjust the base swap cost for swapping one character for another (or inserting or deleting a character). By adjusting the swap cost, you adjust the confidence score, resulting in a more confident match with the same number of character edits to match your regular expression pattern.

For example, the letter "G" can get misrecognized as a "6" during OCR, depending on the document, the fonts used, the OCR engine, or any of the other things that affect character recognition. As we've seen in our example above, the cost to swap a single character between the pattern Grooper and the string "6rooper" is roughly 14%.

What if we only want to return fuzzy matches with a confidence above 90%? We'd be out of luck here. The result would be discarded.

However, we know a "G" can get misrecognized for a "6" during OCR, and the pattern we're trying to match shouldn't even have numerical characters. We might say it should not cost the full 14% confidence to swap the characters.

Perhaps swapping a "6" in the document text data for the G in the pattern should only cost half the swap cost.

With a Fuzzy Match Weighting, we can do just that. This weighting would adjust the base swap cost for a "6" for a G. The base swap cost of 14% would be halved, down to a weighted swap cost of 7%.

Now, the fuzzy match's confidence score is 93%. If we're only returning fuzzy matches with a confidence above 90%, the result would be returned.

For more information on Fuzzy Match Weigthings, visit the How To section of this article.

Words of Caution

Fuzzy RegEx is an exceptional way to resolve unforeseen and unpreventable OCR errors when extracting data. However, it is not a "magic bullet". There are a few potential drawbacks to keep in mind when using Fuzzy RegEx.

False Positive Potential

The basic idea behind Fuzzy RegEx is it performs character edits (swaps, insertions, and deletions) to allow a string of text to match a regular expression pattern where otherwise it would not. It looks for matches that are "close enough" and returns them as valid. Often "close enough" is just fine, but there is potential that you return something you do not want as well.

How different are the phrases "consecutive years" and "consecutive days"? Semantically, they are very different. You'll find this type of language in legal agreements. If you agree to do something every day for a number of consecutive days, that's very different from doing something every year for a number of consecutive years.

However, Fuzzy RegEx doesn't know what "years" and "days" mean, only how similar they are based on how few character edits it takes. What the Levenshtein distance between them is. So, what's the difference? Just three characters.

Fuzzy Pattern: consecutive days
Text Data: consecutive years
  1. Swap the "y" in "years" for a d
    • Gives you "consecutive dears"
  2. Next, delete the "e" in "dears"
    • Gives you "consecutive dars"
  3. Last, swap the "r" in "dars" for a y
    • Gives you "consecutive days"

There's a total 16 characters in the pattern. Therefore, each swap costs 6.25% confidence. 3 swaps for a total swap cost of 18.75%. Fuzzy RegEx would match "consecutive years" for consecutive days with an 81.25% confidence.

It's important to recognize this potential to produce false positive results. Depending on your situation this can effect the accuracy of your results. You may be able to eliminate them by adjusting your Minimum Similarity threshold, only allowing higher confident results. You may need to exempt these results through other methods, such as with an exclusion extractor. You may need human review of Grooper's automated results to ensure total accuracy. Or, you may be fine with the margin of error this produces.

Whatever your take on the situation, be aware Fuzzy RegEx's ability to capture more data comes at the potential cost of capturing inaccurate data.

Incompatibility With Infinite Quantifiers

In regular expression, single characters can be pattern matched, as can ranges of characters or sets of characters. For example, \d will match a single digit character (0 through 9) and \d{2,5} will match digit characters two to five digits in length.

There are two infinite length quantifiers:

  • The plus quantifier +, matching a character or character set one to any number in length.
  • The star quantifier *, matching a character or character set zero to any number in length.

Fuzzy RegEx looks at each character in the pattern, one after the other, compared to the text data to see if a character swap is required. If there are 20 characters in the regex pattern, Fuzzy RegEx is looking for 20 potential character swaps for every string of characters in the text data. With the + and * quantifiers, the length of the regex pattern is potentially infinite characters in length. Fuzzy RegEx would never stop looking to see if characters could be swapped.

Fuzzy RegEx is therefore incompatible with infinite quantifiers. Grooper will give you an error if the Mode is set to Fuzzy RegEx and these quantifiers are used in your pattern.

It's Slower Than Standard RegEx

While this may or may not ultimately matter, it's worth pointing out. Because Fuzzy RegEx is considering much more of the text data for potential matches than the binary "match" or "doesn't match" approach to standard regex, it requires more processing power to perform. It is therefore, slower.

This may be the difference between a Data Type taking 2 ms to extract a value from a document compared to 20 ms. This may not seem like much, but when you're processing thousands of documents and/or using hundreds of extractors, the processing time can add up. Furthermore, depending on the complexity of the extractor, this speed for a single extractor to return a single value may be even greater.

How To

Fuzzy RegEx Basics

Note from the Author: Fuzzy RegEx is intended to resolve issues around OCR errors. However, for the purposes of this tutorial, the text is simply misspelled on the "document". Either way, whether misspelled or mis-OCR'd Fuzzy RegEx works to swap characters between the regular expression pattern and whatever is in the text data obtained via the Recognize activity.

Getting to a Pattern Editor

You can enable Fuzzy RegEx any time you are writing a regular expression pattern in a "Pattern Editor". The "Pattern Editor" can be accessed at several points in Grooper:


When configuring a Value Reader:

  1. Select or create a Value Reader.
  2. Click the drop-down for the Extractor property and choose a text parsing extractor like Pattern Match, then click the ellipsis button for the property.


  1. This will open the Extractor window. From here you can edit the pattern on the fields of the Expressions tab.


When configuring the Local Extractor property of a Data Type:

  1. Click the drop-down button for the Local Extractor property of a Data Type.
  2. Select a text parsing extractor from the drop-down list like Pattern Match.
  3. Click the ellipsis button for the property to open the to open the Local Extractor window. From here you can edit the pattern on the fields of the Expressions tab.


When configuring the Value Extractor property of a Data Field:

  1. Click the drop-down for the Value Extractor property of a Data Field.
  2. Select a text parsing extractor from the drop-down list like Pattern Match.
  3. Click the ellipsis button for the property to open the to open the Extractor window. From here you can edit the pattern on the fields of the Expressions tab.


When configuring any object in Grooper that contains a Value Extractor property:

  1. Click the drop-down for the Value Extractor property of the respective object.
  2. Select a text parsing extractor from the drop-down list like Pattern Match.
  3. Click the ellipsis button for the property to open the to open the Extractor window. From here you can edit the pattern on the fields of the Expressions tab.

Enable Fuzzy RegEx

Once you are in a window, like the Local Extractor window of a Data Type, which contains the Expressions and Properties tabs, enabling Fuzzy RegEx is very simple.


  1. For a text parsing extractor like Pattern Match, Enter a regular expression into the Value Pattern field on the Expressions tab.
    • For a text parsing extractor like List Match, the Value Pattern field is called Local Entries instead.
  2. A result will be highlighted on your document, as well as be listed in the Result List view.


"Fuzzy RegEx" is enabled by enabling the Fuzzy Matching property within the Properties tab of an expression editing window. When enabling the Fuzzy Matching property you will see the property now displays its default setting of "90% Unweighted".

Adjust Minimum Similarity

To adjust the "weighting" of the Fuzzy Matching property, you need to expand its sub-properties by clicking the drop-down arrow to the left of the Fuzzy Matching property once it is enabled. From there you can adjust its sub-property, Minimum Similarity, to a percentage that will allow matches that meet the given threshold.

Be careful with the Minimum Similarity property as putting the percentage too low can allow matches that are too dissimilar to the desired pattern.



Fuzzy Match Weightings Basics

Fuzzy RegEx is intended to resolve issues around OCR errors. However, for the purposes of this tutorial, the text is simply misspelled on the "document". Either way, whether misspelled or mis-OCR'd Fuzzy RegEx works to swap characters between the regular expression pattern and whatever is in the text data obtained via the Recognize activity.

Adjust Fuzzy Match Weightings

By default, it costs the same to swap, insert, or delete a character regardless of the character. However, certain OCR mistakes are more common than others. For example, an "O" might get mis-recognized as a zero (or vice versa).

You can manually change the swap cost associated certain characters using the Weightings sub-property of the Fuzzy Matching property.

  1. Once the Fuzzy Matching property is enabled, expand its sub-properties to expose the Weightings property.
  2. Expand the sub-properties of the Weightings property to expose its sub-properties.
  3. Click the ellipsis button of the Local Entries property to open the "Local Entries" window.


You can alter the match weightings using either the Local Entries or the Included Lexicons properties (or both). We will focus on Local Entries for the time being. Depending on what you want to do, what you will enter here will change somewhat. Character swaps, inserts and deletions all have different syntax to alter the swap cost of certain characters.

Character Swaps

Imagine a "6" is often mis-OCR'd as a "G", resulting in "6rooper" in the text data instead of "Grooper". Knowing this is a common error, we might want to say the swap should not cost the full amount in this case.

This is exactly the reason why Fuzzy Match Weightings exist! In the list editor, you will adjust the swap cost using the following syntax:

(The mis-OCR'd character on the document) (The correct character in the regex pattern) = (Percentage change as decimal)

or, put another way:

{DocumentChar}{PatternChar}=Cost

So, if we wanted to say whenever Fuzzy RegEx swaps a "6" for a "G" it should only cost half the amount, you'd use the following syntax:

6G=0.5

Fuzzy Match Weightings work by first adjusting the normal baseline swap cost of a character.

  • Base Swap Cost(DocumentChar, PatternChar) x Weighting Value = Weighted Swap Cost

Then the weighted swap cost is subtracted, instead of the baseline swap cost to determine the confidence score.

  • 100% - Weighted Swap Cost = Confidence Score


Keep in mind the cost of swapping, inserting, or deleting characters in fuzzy matching (which effects the confidence of the returned result) is entirely dependant on the length of the string you are attempting to match.

For the supplied example, the swap would cost is a confidence of 14%, but the weighting 6G=0.5 multiplied that by "0.5", resulting in a swap cost of only 7%.


Fuzzy Match Weightings work by first adjusting the normal baseline swap cost of a character.

  • Base Swap Cost({DocumentChar}{PatternChar}) x Weighting Value = Weighted Swap Cost
  • 14%(6, G) x 0.5 = 7%

Then the weighted swap cost is subtracted, instead of the baseline swap cost to determine the confidence score.

  • 100% - Weighted Swap Cost = Confidence Score
  • 100% - 7% = 93%

A message concerning case sensitivity:

For the character pair, {DocumentChar}{PatternChar}, one of the characters is case sensitive where the other is not.

{DocumentChar} is case sensitive.
{PatternChar} is not case sensitive.

This means s5=0.5 would only adjust the cost to swap a lowercase "s" in the text data for a "5" in the regex pattern.

However, 5s=0.5 would adjust either the cost to swap a "5" in the text data for a lowercase "s" or uppercase "S" in the regex pattern.

FYI

As well as decreasing the cost to perform a swap, you can also increase the cost.

6G=0.5 would decrease the cost to perform a swap from "6" to "G" by 50%.

6G=1.5 would increase the cost to perform a swap from "6" to "G" by 50%.

Character Inserts

As well as adjusting the cost to make a true "swap" between two characters, you can also adjust the cost to insert a character. For example, we could cause "Groper" to fuzzy match Grooper at a higher confidence score using fuzzy match weighting that decreases the cost of inserting an "o" character.

In the list editor, you will adjust the insert swap cost using the following syntax:

Insert(The character or characters you want to insert) = (Percentage change as a decimal)

or, put another way:

Insert(CharSet)=Cost

So, if we want to tell Fuzzy RegEx the cost to insert an "o" into a text string should be half of what it normally is, you'd use the following syntax:

Insert(o)=0.5

FYI

You can also adjust the cost of multiple character inserts all at once by simply including more characters in the character set.

For example, if you wanted to reduce the cost to insert a period (.), comma (,) or semicolon(;) by half, you could use the following fuzzy match weighting:

Insert(.,;)=0.5


Normally, inserting this "o" to match the regex Grooper would cost a confidence of 14%, but the weighting Insert(o)=0.5 multiplied that by "0.5", resulting in a swap cost of only 7%.

Fuzzy Match Weightings work by first adjusting the normal baseline swap cost of a character (in this case, an insert cost).

  • Base Insert Cost(CharSet) x Weighting Value = Weighted Swap Cost
  • 14%(o) x 0.5 = 7%

Then the weighted swap cost is subtracted, instead of the baseline swap cost to determine the confidence score.

  • 100% - Weighted Swap Cost = Confidence Score
  • 100% - 7% = 93%


FYI

One common real-world use of the insert fuzzy match weighting is to adjust the cost to insert spaces. Space characters can often get inappropriately inserted or removed during OCR.

To adjust the insert cost of a space character, simply type a literal space between the parentheses, as seen below:

Insert( )=0.25

Character Deletions

You can also adjust the cost to remove or delete a character. For example, we could cause "Groooper" to fuzzy match Grooper at a higher confidence score using fuzzy match weighting that decreases the cost of deleting an "o" character.

In the list editor, you will adjust the delete swap cost using the following syntax:

Delete(The character or characters you want to delete) = (Percentage change as a decimal)

or, put another way:

Delete(CharSet)=Cost

So, if we want to tell Fuzzy RegEx the cost to remove an "o" from a text string should be half of what it normally is, you'd use the following syntax:

Delete(o)=0.5

FYI

You can also adjust the cost of multiple character deletions all at once by simply including more characters in the character set.

For example, if you wanted to reduce the cost to delete a period (.), comma (,) or semicolon(;) by half, you could use the following fuzzy match weighting:

Delete(.,;)=0.5


Normally, removing an "o" to match the regex Grooper would cost a confidence of 14%, but the weighting Delete(o)=0.5 multiplied that by "0.5", resulting in a swap cost of only 7%.

Fuzzy Match Weightings work by first adjusting the normal baseline swap cost of a character (in this case, a delete cost).

  • Base Delete Cost(CharSet) x Weighting Value = Weighted Swap Cost
  • 14%(o) x 0.5 = 7%

Then the weighted swap cost is subtracted, instead of the baseline swap cost to determine the confidence score.

  • 100% - Weighted Swap Cost = Confidence Score
  • 100% - 7% = 93%


FYI

One common real-world use of the delete fuzzy match weighting is to adjust the cost to delete extra spaces. Space characters can often get inappropriately inserted or removed during OCR.

To adjust the delete cost of a space character, simply type a literal space between the parentheses, as seen below:

Delete( )=0.25

Multiple Match Weightings

  1. The Local Entries property may contain different variations of swapping, inserting, and deleting all at once. You need only get the syntax of each type correct, and you can put as many entries into the list as you desire. The different entries need only be separated on each line of the list.


You can see next to the Local Entries property, 6 entries is listed.

Each of the fuzzy match weightings we've entered into the Local Entries "List Editor" will be used to adjust confidence scores.

You're not just limited to a single match weighting. You can create a fuzzy match weighting list as large or small as you would like.

Fuzzy Match Lexicons

Furthermore, you may find you have a single list of fuzzy match weightings that works well across many different documents. You probably don't want to re-enter that list over and over again every time you write a new regex pattern. This is what the Included Lexicons property under Fuzzy Match Weightings is for.

Rather than manually inputting the weightings into the Local Entries property's list editor, you can just point to a list of these weightings already entered into a Grooper Lexicon.

Grooper ships with a few pre-built "Fuzzy Match Lexicons", or you can build your own. You can find the pre-built Lexicons navigating the following path in the Grooper Node Tree:

(root node) > Global Resources > Lexicons > Downloads > Weightings

The Lexicon selected here contains some adjusted weightings for common OCR errors such as erroneously recognizing an "O" as a zero.


To reference a Lexicon containing fuzzy match weightings:


Two fuzzy matches are returned using the weightings in this Lexicon.


Furthermore, you are not limited to either Local Entries or Included Lexicons. You can use both

Here, the Fuzzy Match Weightings properties are configured to reference the "Fuzzy Match Weightings" Lexicon in the Global Resources folder as well as Insert(o)=0.5 and Delete(o)=0.5 in the Local Entries.

Required Mode

Now that we know the basics lets look at a more advanced problem and how to solve it. The solution will involve "Required Mode", which will throw out fuzzy match results if portions of the regular expression pattern are swapped.

The Goal

We have a simple goal. We want to extract the label "Taxable Value" from each of the five repeating sections of this document. At the end of this, we should have an extractor that returns five instances of the result "Taxable Value", one for each section.


However, this document was purposefully degraded to have some troubling OCR results. We will need to use Fuzzy RegEx to return most of these labels.

Much like any other extraction problem, we will need to look for ways to throw out false positive results. For example we won't want to match "Taxable Value" in the label at the bottom of the page, "Total Taxable Value".


Since Fuzzy RegEx swaps characters to mutate text data to match a regular expression pattern, there is a danger that some text in the document is improperly mutated, giving a false positive result.

The Fuzzy Pattern

With a standard regex pattern of taxable value, only two (out of our five desired) results are returned.

We definitely need to use Fuzzy RegEx mode in order to match our other labels.



Let's deal with throwing out the false positive result. We should be able to do this easily with Tab Marking. Tab Marking can assist in regular expression pattern matching by providing a text character (\t) that can match large amounts of whitespace between characters.


Tabs characters are markers of large gaps of horizontal space. What distinguishes the label "Taxable Value" and the words "Taxable Value" in the label "Total Taxable Value" is the fact the word "Total" is right before the words. The gaps of space on the left and right side of "Taxable Value" denotes that it is its own label on the document.

Knowing this, we can add a \t character to the Prefix and Suffix Patterns to distinguish the text in the two different labels. The tab character can act as an "anchor" for us, only returning a result if the Value Pattern is anchored between large gaps of whitespace on either side.

The Problem

Knowing the label not being returned is just two characters different from the regex pattern, let's try dropping the minimum similarity to allow for the extra swap.


Why is that the case? What about our tab characters in the Prefix and Suffix Patterns? Why aren't they throwing out this result?

Here's the rub. Fuzzy RegEx matches the entire regular expression pattern, not just what's in the Value Pattern but what's in the Prefix and Suffix Patterns as well. In this case the \t in the Prefix Pattern is being swapped with the space character before "Taxable Value" in "Total Taxable Value".

You can see this clearly using the "Fuzzy Extraction Visualizer" tab. This is a visualization of the math behind the Levenshtein distance equation used to calculate the minimum number of character swaps to produce a match.


The Solution - Required Mode

However, we don't want this swap to occur. We want to use these tab characters as hard anchors for returning a result. The necessarily should be there to return a result.

We can accomplish this through "Required Mode". Required Mode allows you to place portions of the regular expression in a "required group". If any character inside this group is swapped by Fuzzy RegEx the result is discarded from the results list.



Immutable Characters

You may run into a situation where you have a character or set of characters you don't want a pattern to mutate into. There is a special fuzzy match weighting called an "Immutable" that allows you to do just this. These weightings can prevent results from returning if a certain character is swapped for another in your regex pattern.

Introduction

Imagine you have a document like the one here and you want to match the column header "TOTAL SALES".

The "TOTAL SALES" header in the top table looks very clean and will likely OCR with no issues. However, the "A" in "TOTAL" in the bottom table is less likely to be OCR'd correctly. We'll probably need to use Fuzzy RegEx to match it.


Let's see what happens in a Pattern Editor. With this document imported and OCR'd in Grooper, this bears out, but with an additional problem.


As we've seen in the previous example, large whitespace characters, such as \r, \n, \t, and \f, are often used as spatial anchors to distinguish one result from another.

The typical method for throwing out these false positives is to enable Tab Marking. The large amount of whitespace will be recognized tab character instead of a single space.

The Problem - The Fuzzy Result

Let's enable Fuzzy RegEx and see if we get a match.


In this case, the tab character in the text data is swapped for a space character to match the space character in the regular expression total sales

The Solution - Immutable Fuzzy Weightings

The "Immutable" fuzzy match weighting allows you to list a character set you do not want to mutate. This will allow us to tell Grooper if you see a \t character in the text data, do not swap it for any character in the regex pattern. It makes the swap cost infinite, throwing out the fuzzy match.


The syntax for the Immutable character set is as follows:

Immutable={CharacterList}

In this case, we just have a single character we do not want to mutate. Our Immutable weighting is as follows:

Immutable=\t


With this fuzzy match weighting in place, the two results that mutated a \t character into a single space character are discarded, leaving us the desired results.

FYI

Often it is the case all the large white space characters, including tab characters, should be considered immutable. Since they are used to distinguish large gaps in whitespace from a single space, very rarely (if ever) does it make sense to swap them for a single space characters.

The most common Immutable weighting is as follows:

Immutable=\r\n\t\f

It can be easy to confuse Required Mode with Immutable Characters and vice versa. While they both throw out fuzzy matches, they do it in very different ways.


The difference to keep in mind between Required Mode and Immutable Characters is:

Required Mode prevents a pattern character in the regex from matching a different character in the text data.

  • Another way of thinking about it is Required Mode makes that portion of the regex behave as if it were not fuzzy. An exact match is required.

Immutable Characters prevents a document character in the text data from matching a different character in the regex.

  • Another way of thinking about it is Immutable Characters make the cost to swap a character in the text data infinite. Remember the normal swap weighting syntax:
    • {Document Character}{Pattern Character}=Cost
  • Immutable characters make the cost to swap infinite, or:
    • {Immutable Character}{Any Pattern Character}=Infinite