2.90:Fuzzy RegEx (Concept): Difference between revisions

From Grooper Wiki
No edit summary
Line 26: Line 26:
|
|
[[File:Fuzzy-regex-about-01.png]]
[[File:Fuzzy-regex-about-01.png]]
|}
{|
|style="width:33%; text-align:center" valign=top|
What's the difference between these two strings of text?
|style="width:33%; text-align:center" valign=top|
One starts with a "G".  The other starts with the number six.
|style="width:33%; text-align:center" valign=top|
If we used a simple regular expression to literally match <code>Grooper</code>, the first would match, and the second would not.
|-
|[[File:Fuzzy-regex-about-02.png|center]]
|[[File:Fuzzy-regex-about-03.png|center]]
|[[File:Fuzzy-regex-about-04.png|center]]
|}
|}

Revision as of 12:58, 12 November 2020

Fuzzy RegEx (also referred to as "fuzzy matching" or "fuzzy mode" or even just "fuzzy") allows regular expression patterns to match text within a set percentage of similarity. This can allow Grooper users to overcome unpredictable OCR errors when extracting data from documents.

Typically, regular expression will either match a string of text or it won't. If you're trying to match a word and the regex pattern is even a single character off from the text data, you will not return a result.

Fuzzy RegEx uses a Levenshtein distance equation to measure the difference between the regular expression and potential text matches. The percentage difference between the regex pattern and the matched text is expressed as a "confidence score" (also as a percentage). If the confidence is above a set threshold, the result is returned. If it is below the threshold, it is discarded.

For example, a text string that is 95% similar to the regex pattern may be off by just a single character. If the Minimum Similarity threshold is set to 90% the result would be returned, even thought the pattern doesn't match the text exactly.

About

Fuzzy RegEx is a Match Mode option for an extractor's regular expression pattern. Any time you can get to a "Pattern Editor" in Grooper you can take advantage of Fuzzy RegEx including:

  • When configuring a Data Format
  • When configuring the Pattern property of a Data Type
  • When choosing the Internal option for various extractors in a property panel and configuring its Pattern property
  • When choosing the Text Pattern option for various extractors in a property panel and configuring its Pattern property


In the "Pattern Editor" window, you can enable Fuzzy RegEx in two steps:

  1. Navigate to the "Properties" tab.
  2. Select the Mode property and choose Fuzzy RegEx

What's the difference between these two strings of text?

One starts with a "G". The other starts with the number six.

If we used a simple regular expression to literally match Grooper, the first would match, and the second would not.