2.90:Fuzzy RegEx (Concept): Difference between revisions
Dgreenwood (talk | contribs) No edit summary |
Dgreenwood (talk | contribs) |
||
| Line 56: | Line 56: | ||
| | | | ||
If you break up the string "Grooper" into characters, it is composed of seven characters. For the regex pattern <code>Grooper</code>, each character in the pattern matches each character in the string starting with "G" through the last "r". The only difference between "Grooper" and "6rooper" is the "G" is replaced by a "6". But otherwise, six out of the seven characters match the regex pattern <code>Grooper</code>. | If you break up the string "Grooper" into characters, it is composed of seven characters. For the regex pattern <code>Grooper</code>, each character in the pattern matches each character in the string starting with "G" through the last "r". The only difference between "Grooper" and "6rooper" is the "G" is replaced by a "6". But otherwise, six out of the seven characters match the regex pattern <code>Grooper</code>. | ||
|style="width: | |style="width:30%" valign=top| | ||
[[File:Fuzzy-regex-about-05.png]] | [[File:Fuzzy-regex-about-05.png]] | ||
|} | |} | ||
Being so close, it sure seems like the regex ''should'' match (Or at least we surely might like it to!). ''Fuzzy RegEx'' gives you the ability to return the results of these "near matches". As long as they are similar enough to the regular expression pattern, the result is still returned, allowing you to extract data from imperfect OCR results! | Being so close, it sure seems like the regex ''should'' match (Or at least we surely might like it to!). ''Fuzzy RegEx'' gives you the ability to return the results of these "near matches". As long as they are similar enough to the regular expression pattern, the result is still returned, allowing you to extract data from imperfect OCR results! | ||
Revision as of 15:18, 12 November 2020
Fuzzy RegEx (also referred to as "fuzzy matching" or "fuzzy mode" or even just "fuzzy") allows regular expression patterns to match text within a set percentage of similarity. This can allow Grooper users to overcome unpredictable OCR errors when extracting data from documents.
Typically, regular expression will either match a string of text or it won't. If you're trying to match a word and the regex pattern is even a single character off from the text data, you will not return a result.
Fuzzy RegEx uses a Levenshtein distance equation to measure the difference between the regular expression and potential text matches. The percentage difference between the regex pattern and the matched text is expressed as a "confidence score" (also as a percentage). If the confidence is above a set threshold, the result is returned. If it is below the threshold, it is discarded.
For example, a text string that is 95% similar to the regex pattern may be off by just a single character. If the Minimum Similarity threshold is set to 90% the result would be returned, even thought the pattern doesn't match the text exactly.
About
|
Fuzzy RegEx is a Match Mode option for an extractor's regular expression pattern. Any time you can get to a "Pattern Editor" in Grooper you can take advantage of Fuzzy RegEx including:
|
The Problem
Standard regular expression is binary. Either a result will match the pattern or it won't. The only "wiggle room" is baked into the standard syntax, such as the wildcard "dot" (.) metacharacter or the question mark (?) metacharacter optionally matching a character or character group. Even then, it's all or nothing. The text will 100% match the pattern as written, or it 0% will not.
However, OCR results are imperfect. Characters are recognized as different from what is on the page. Spaces get inserted or removed between characters. Even with Image Processing and OCR Synthesis software as good as Grooper has, you can't assume 100% accurate OCR results on every document. And, even if the OCR results are one tiny character off, the pattern will fail to produce a match.
For example, it's not outside of the realm of possibility an OCR engine would see a cluster of pixels representing a "G" and erroneously recognize it as a "6", depending on the font used or quality of the scanned image.
How Fuzzy RegEx Solves the Problem
|
If you break up the string "Grooper" into characters, it is composed of seven characters. For the regex pattern |
Being so close, it sure seems like the regex should match (Or at least we surely might like it to!). Fuzzy RegEx gives you the ability to return the results of these "near matches". As long as they are similar enough to the regular expression pattern, the result is still returned, allowing you to extract data from imperfect OCR results!




