2.90:Fuzzy RegEx (Concept)

From Grooper Wiki
Revision as of 13:01, 12 November 2020 by Dgreenwood (talk | contribs) (→‎About)

Fuzzy RegEx (also referred to as "fuzzy matching" or "fuzzy mode" or even just "fuzzy") allows regular expression patterns to match text within a set percentage of similarity. This can allow Grooper users to overcome unpredictable OCR errors when extracting data from documents.

Typically, regular expression will either match a string of text or it won't. If you're trying to match a word and the regex pattern is even a single character off from the text data, you will not return a result.

Fuzzy RegEx uses a Levenshtein distance equation to measure the difference between the regular expression and potential text matches. The percentage difference between the regex pattern and the matched text is expressed as a "confidence score" (also as a percentage). If the confidence is above a set threshold, the result is returned. If it is below the threshold, it is discarded.

For example, a text string that is 95% similar to the regex pattern may be off by just a single character. If the Minimum Similarity threshold is set to 90% the result would be returned, even thought the pattern doesn't match the text exactly.

About

Fuzzy RegEx is a Match Mode option for an extractor's regular expression pattern. Any time you can get to a "Pattern Editor" in Grooper you can take advantage of Fuzzy RegEx including:

  • When configuring a Data Format
  • When configuring the Pattern property of a Data Type
  • When choosing the Internal option for various extractors in a property panel and configuring its Pattern property
  • When choosing the Text Pattern option for various extractors in a property panel and configuring its Pattern property


In the "Pattern Editor" window, you can enable Fuzzy RegEx in two steps:

  1. Navigate to the "Properties" tab.
  2. Select the Mode property and choose Fuzzy RegEx


Standard regular expression is binary. Either a result will match the pattern or it won't. There is no "wiggle room" outside of what is provided from the standard syntax, such as the wildcard "dot" (.) character.

What's the difference between these two strings of text?

One starts with a "G". The other starts with the number six.

If we used a simple regular expression to literally match Grooper, the first would match, and the second would not.