2.90:Fuzzy RegEx (Concept): Difference between revisions

From Grooper Wiki
Created page with "<blockquote style="font-size:14pt"> ''Fuzzy RegEx'' allows regular expression patterns to match text within a set percentage of similarity. This can allow Grooper users to ov..."
 
No edit summary
Line 7: Line 7:
''Fuzzy RegEx'' uses a [https://en.wikipedia.org/wiki/Levenshtein_distance Levenshtein distance] equation to measure the difference between the regular expression and potential text matches.  The percentage difference between the regex pattern and the matched text is expressed as a "confidence score" (also as a percentage).  If the confidence is above a set threshold, the result is returned.  If it is below the threshold, it is discarded.   
''Fuzzy RegEx'' uses a [https://en.wikipedia.org/wiki/Levenshtein_distance Levenshtein distance] equation to measure the difference between the regular expression and potential text matches.  The percentage difference between the regex pattern and the matched text is expressed as a "confidence score" (also as a percentage).  If the confidence is above a set threshold, the result is returned.  If it is below the threshold, it is discarded.   


For example, a string that is 95% similar to the regex pattern may be off by just a single character.  If the '''''Minimum Similarity''''' threshold is set to ''90%'' the result would be returned.
For example, a text string that is 95% similar to the regex pattern may be off by just a single character.  If the '''''Minimum Similarity''''' threshold is set to ''90%'' the result would be returned, even thought the pattern doesn't match the text ''exactly''.


== About ==
== About ==

Revision as of 11:19, 12 November 2020

Fuzzy RegEx allows regular expression patterns to match text within a set percentage of similarity. This can allow Grooper users to overcome unpredictable OCR errors when extracting data from documents.

Typically, regular expression will either match a string of text or it won't. If you're trying to match a word and the regex pattern is even a single character off from the text data, you will not return a result.

Fuzzy RegEx uses a Levenshtein distance equation to measure the difference between the regular expression and potential text matches. The percentage difference between the regex pattern and the matched text is expressed as a "confidence score" (also as a percentage). If the confidence is above a set threshold, the result is returned. If it is below the threshold, it is discarded.

For example, a text string that is 95% similar to the regex pattern may be off by just a single character. If the Minimum Similarity threshold is set to 90% the result would be returned, even thought the pattern doesn't match the text exactly.

About