2023:Word Match (Value Extractor)
|
WIP |
This article is a work-in-progress or created as a placeholder for testing purposes. This article is subject to change and/or expansion. It may be incomplete, inaccurate, or stop abruptly. This tag will be removed upon draft completion. |
Word Match
The Word Match extractor is designed for n-gram extraction. An n-gram is "a contiguous sequence of n items from a given sample of text or speech." [1] Typically in Grooper, this refers to extracting words or phrases from a lexicon of terms. Often, this is for the purposes of feature collection for Lexical Classification. The Word Match extractor can capture 1-grams (single words) up to 5-grams (five word phrases). Lexicons are commonly used to dictate a dictionary of allowable returned words. This could be general Lexicon of common English words or a custom Lexicon, such as one with industry specific terms.
| FYI |
An n-gram is often referred to by a different name depending its n size.
As an additional FYI, four-grams are not called "tetragrams" because the term already has usage as a single word consisting of four letters or characters. "Quadrigram" is occasionally used, but four-gram is the more common terminology. Five-grams are not called "pentagrams", because that already has common usage for a geometric figure. |
Just like with Pattern Match, you can enter Prefix and Suffix Patterns to only return an n-gram if a regex pattern also matches before or after. These are useful for anchoring the n-gram you want to return next to some other piece of text. For example, a Prefix Pattern of \n could be used to only return n-grams at the start of a new line because the \n character precedes every new line in the text data. Furthermore, only the n-gram is returned, not the text matched by the Prefix and Suffix Patterns.
The Join Pattern property is unique to the Word Match extractor. This determines how terms of bigrams, trigrams, four-grams, and five-grams can be joined. Most often, terms (or grams) are simply joined by a single space, as in the bigram "first second". If you leave this property blank, Grooper will assume n-grams are always separated by a single space. However, you may want to include n-grams that are separated by other characters. For example hyphenated words, as in "first-second". The Join Pattern allows you to enter a regular expression for the allowable characters between two grams. For example, a Join Pattern of [ -] would allow for a single space or hyphen to be between each term, matching "first second" as well as "first-second".
The Output Format allows you to alter the output result for data cleansing or other purposes.
The "Properties" tab allows you to further configure the n-gram matching. Most importantly, the n-gram size is set here as well as any Lexicon used to lookup against the returned values. You can also enable Tab Marking, Fuzzy RegEx mode, filter results based on page location, determine case sensitivity, and more.
|
In this example, a Value Reader is configured to return bigram field labels, using the Word Match Extractor Type.
|
|
|
In this case, we also used the "Properties" tab to set the n-gram size to collect bigrams, and only return grams in a English language dictionary.
|
| FYI | Prior to Grooper version 2021, n-gram extraction configuration was lumped into other regular expression pattern configurations. As with the Pattern Match extractor, this was delivered in one of two ways:
Each of these methods used a "Pattern Editor" UI screen to configure a regular expression. The n-gram size and referenced term lexicons were set in the "Properties" tab. In version 2021, the Data Format object and the Internal and Text Pattern extractor types are gone. The Word Match extractor replaces their functionality to return n-grams in an effort to simplify n-gram extraction setup and distinguish it from general regex pattern matching. |

