2023:Word Match (Value Extractor): Difference between revisions
No edit summary |
No edit summary |
||
| Line 1: | Line 1: | ||
<blockquote> | <blockquote> | ||
The '''''Word Match''''' is an '''''Extractor Type''''' found in Grooper. This extractor is designed to collect full words and is often used in n-gram extraction. | The '''''Word Match''''' is an '''''Extractor Type''''' found in Grooper. This extractor is designed to collect full words and is often used in n-gram extraction. | ||
Revision as of 14:17, 6 December 2023
The Word Match is an Extractor Type found in Grooper. This extractor is designed to collect full words and is often used in n-gram extraction.
About
The Word Match extractor is designed for n-gram extraction. An n-gram is "a contiguous sequence of n items from a given sample of text or speech." [1] Typically in Grooper, this refers to extracting words or phrases from a lexicon of terms.
Grooper generally uses n-grams for the purpose of feature collection for Lexical Classification. The Word Match extractor can capture 1-grams (single words) up to 5-grams (five word phrases). Lexicons are commonly used to dictate a dictionary of allowable returned words. This could be general Lexicon of common English words or a custom Lexicon, such as one with industry specific terms.
| FYI |
An n-gram is often referred to by a different name depending its n size.
As an additional FYI, four-grams are not called "tetragrams" because the term already has usage as a single word consisting of four letters or characters. "Quadrigram" is occasionally used, but four-gram is the more common terminology. Five-grams are not called "pentagrams", because that already has common usage for a geometric figure. |
How To
Setup
First, let's set the Word Match extractor on a Data Type.
Adding a Lexicon
You can add a Lexicon to a Word Match to aid in extraction. Word Match often collects "words" that aren't actually in the English language. Let's say you want to change to to only collect words that are included in the Engish dictionary.
Changing the N-Gram
By default the Word Match extractor collects single words or unigrams. Let's say you want to change this to bigrams, trigrams, four-grams, etc. We can do this from the "Properties" tab on the extractor window.
Join Patterns
Sometimes words can be separated by something other than a single space. Words can be hyphenated, have commas between them, be connected with an ampersand, or other number of things. To still include these words in an n-gram, you might want to write a Join Pattern using regex for your extractor.
| ⚠ |
For information on the Prefix and Suffix patterns, visit the Data Context wiki page. |







