2.90:Fuzzy RegEx (Concept): Difference between revisions

Revision as of 14:48, 12 November 2020

Fuzzy RegEx (also referred to as "fuzzy matching" or "fuzzy mode" or even just "fuzzy") allows regular expression patterns to match text within a set percentage of similarity. This can allow Grooper users to overcome unpredictable OCR errors when extracting data from documents.

Typically, regular expression will either match a string of text or it won't. If you're trying to match a word and the regex pattern is even a single character off from the text data, you will not return a result.

Fuzzy RegEx uses a Levenshtein distance equation to measure the difference between the regular expression and potential text matches. The percentage difference between the regex pattern and the matched text is expressed as a "confidence score" (also as a percentage). If the confidence is above a set threshold, the result is returned. If it is below the threshold, it is discarded.

For example, a text string that is 95% similar to the regex pattern may be off by just a single character. If the Minimum Similarity threshold is set to 90% the result would be returned, even thought the pattern doesn't match the text exactly.

About

Fuzzy RegEx is a Match Mode option for an extractor's regular expression pattern. Any time you can get to a "Pattern Editor" in Grooper you can take advantage of Fuzzy RegEx including:

When configuring a Data Format
When configuring the Pattern property of a Data Type
When choosing the Internal option for various extractors in a property panel and configuring its Pattern property
When choosing the Text Pattern option for various extractors in a property panel and configuring its Pattern property

In the "Pattern Editor" window, you can enable Fuzzy RegEx in two steps:

Navigate to the "Properties" tab.
Select the Mode property and choose Fuzzy RegEx

The Problem

Standard regular expression is binary. Either a result will match the pattern or it won't. The only "wiggle room" is baked into the standard syntax, such as the wildcard "dot" (.) metacharacter or the question mark (?) metacharacter optionally matching a character or character group. Even then, it's all or nothing. The text will 100% match the pattern as written, or it 0% will not.

What's the difference between these two strings of text?	One starts with a "G". The other starts with the number six.	If we used a simple regular expression to literally match `Grooper`, the first would match, and the second would not.

However, OCR results are imperfect. Characters are recognized as different from what is on the page. Spaces get inserted or removed between characters. Even with Image Processing and OCR Synthesis software as good as Grooper has, you can't assume 100% accurate OCR results on every document. And, even if the OCR results are one tiny character off, the pattern will fail to produce a match.

For example, it's not outside of the realm of possibility an OCR engine would see a cluster of pixels representing a "G" and erroneously recognize it as a "6", depending on the font used or quality of the scanned image.

How Fuzzy RegEx Solves the Problem

If you break up the string "Grooper" into characters, it is composed of seven characters. For the regex pattern Grooper, each character in the pattern matches each character in the string starting with "G" through the last "r". The only difference between "Grooper" and "6rooper" is the "G" is replaced by a "6".

But otherwise, six out of the seven characters match the regex pattern Grooper.

Being so close, it seems like the regex should match (Or at least we surely might like it to!). Fuzzy RegEx gives you the ability to return the results of these "near matches". As long as they are similar enough to the regular expression pattern, the result is still returned, allowing you to extract data from imperfect OCR results!

So, how do you say what should match and what shouldn't? After all, "6200932" is also seven characters long but certainly shouldn't match "Grooper". Should "Gopher" match? Should "Grapple"? Probably not. What is "similar enough"? How do you measure it? How do you control it?

Fuzzy RegEx measures the similarity between a regular expression pattern and a potential match in terms of percentage similarity. The regular expression pattern Grooper matches the text string "Grooper" at a 100% similarity. It is an exact match. Being seven characters long, a single character in the string "Grooper" makes up roughly 14% of the word.

@@ Line 55: / Line 55: @@
 {|
 |
-If you break up the string "Grooper" into characters, it is composed of seven characters. For the regex pattern <code>Grooper</code>, each character in the pattern matches each character in the string starting with "G" through the last "r".  The only difference between "Grooper" and "6rooper" is the "G" is replaced by a "6".  But otherwise, six out of the seven characters match the regex pattern <code>Grooper</code>.
+If you break up the string "Grooper" into characters, it is composed of seven characters. For the regex pattern <code>Grooper</code>, each character in the pattern matches each character in the string starting with "G" through the last "r".  The only difference between "Grooper" and "6rooper" is the "G" is replaced by a "6".
+But otherwise, six out of the seven characters match the regex pattern <code>Grooper</code>.
 |style="width:30%" valign=top|
 [[File:Fuzzy-regex-about-05.png]]
 |}
-Being so close, it sure seems like the regex ''should'' match (Or at least we surely might like it to!).  ''Fuzzy RegEx'' gives you the ability to return the results of these "near matches".  As long as they are similar enough to the regular expression pattern, the result is still returned, allowing you to extract data from imperfect OCR results!
+Being so close, it seems like the regex ''should'' match (Or at least we surely might like it to!).  ''Fuzzy RegEx'' gives you the ability to return the results of these "near matches".  As long as they are similar enough to the regular expression pattern, the result is still returned, allowing you to extract data from imperfect OCR results!
+So, how do you say what should match and what ''shouldn't''?  After all, "6200932" is also seven characters long but certainly shouldn't match "Grooper".  Should "Gopher" match?  Should "Grapple"?  Probably not.  What is "similar enough"?  How do you measure it?  How do you control it?
+''Fuzzy RegEx'' measures the similarity between a regular expression pattern and a potential match in terms of ''percentage similarity''.  The regular expression pattern <code>Grooper</code> matches the text string "Grooper" at a 100% similarity.  It is an exact match.  Being seven characters long, a single character in the string "Grooper" makes up roughly 14% of the word.
+[[File:Fuzzy-regex-about-06.png|center]]