2.90:Fuzzy RegEx (Concept): Difference between revisions
Dgreenwood (talk | contribs) |
Dgreenwood (talk | contribs) No edit summary |
||
| Line 29: | Line 29: | ||
Standard regular expression is binary. Either a result will match the pattern or it won't. | Standard regular expression is binary. Either a result will match the pattern or it won't. The only "wiggle room" is baked into the standard syntax, such as the wildcard "dot" (<code>.</code>) metacharacter or the question mark (<code>?</code>) metacharacter optionally matching a character or character group. Even then, it's all or nothing. The text will 100% match the pattern as written, or it 0% will not. | ||
{| | {| | ||
|style="width:33%; text-align:center" valign=top| | |style="width:33%; text-align:center" valign=top| | ||
Revision as of 13:05, 12 November 2020
Fuzzy RegEx (also referred to as "fuzzy matching" or "fuzzy mode" or even just "fuzzy") allows regular expression patterns to match text within a set percentage of similarity. This can allow Grooper users to overcome unpredictable OCR errors when extracting data from documents.
Typically, regular expression will either match a string of text or it won't. If you're trying to match a word and the regex pattern is even a single character off from the text data, you will not return a result.
Fuzzy RegEx uses a Levenshtein distance equation to measure the difference between the regular expression and potential text matches. The percentage difference between the regex pattern and the matched text is expressed as a "confidence score" (also as a percentage). If the confidence is above a set threshold, the result is returned. If it is below the threshold, it is discarded.
For example, a text string that is 95% similar to the regex pattern may be off by just a single character. If the Minimum Similarity threshold is set to 90% the result would be returned, even thought the pattern doesn't match the text exactly.
About
|
Fuzzy RegEx is a Match Mode option for an extractor's regular expression pattern. Any time you can get to a "Pattern Editor" in Grooper you can take advantage of Fuzzy RegEx including:
|
Standard regular expression is binary. Either a result will match the pattern or it won't. The only "wiggle room" is baked into the standard syntax, such as the wildcard "dot" (.) metacharacter or the question mark (?) metacharacter optionally matching a character or character group. Even then, it's all or nothing. The text will 100% match the pattern as written, or it 0% will not.



