2023:Word Match (Value Extractor): Difference between revisions

From Grooper Wiki
No edit summary
No edit summary
Line 35: Line 35:


== How To ==
== How To ==
=== Setup ===


First, let's set the ''Word Match'' extractor on a '''Data Type'''.  
First, let's set the ''Word Match'' extractor on a '''Data Type'''.  


[[File:2023 Word Match - 2023 01 How To 01 Setup 01.png]]
[[File:2023 Word Match - 2023 01 How To 01 Setup 01.png]]
Line 46: Line 49:
[[File:2023 Word Match - 2023 01 How To 01 Setup 03.png]]
[[File:2023 Word Match - 2023 01 How To 01 Setup 03.png]]


=== Changing the N-Gram ===
By default the ''Word Match'' extractor collects single words or unigrams and often collects "words" that aren't actually in the English language. Let's say you want to change this to bigrams and only want to collect words that are included in the Engish dictionary.
[[File:2023 Word Match - 2023 01 How To 02 Changing the N-Gram 01.png]]
[[File:2023 Word Match - 2023 01 How To 02 Changing the N-Gram 02.png]]
[[File:2023 Word Match - 2023 01 How To 02 Changing the N-Gram 03.png]]


<!--
<!--

Revision as of 12:08, 6 December 2023

WIP

This article is a work-in-progress or created as a placeholder for testing purposes. This article is subject to change and/or expansion. It may be incomplete, inaccurate, or stop abruptly.

This tag will be removed upon draft completion.

The Word Match is an Extractor Type found in Grooper. This extractor is designed to collect full words and is often used in n-gram extraction.

About

The Word Match extractor is designed for n-gram extraction. An n-gram is "a contiguous sequence of n items from a given sample of text or speech." [1] Typically in Grooper, this refers to extracting words or phrases from a lexicon of terms.

Grooper generally uses n-grams for the purpose of feature collection for Lexical Classification. The Word Match extractor can capture 1-grams (single words) up to 5-grams (five word phrases). Lexicons are commonly used to dictate a dictionary of allowable returned words. This could be general Lexicon of common English words or a custom Lexicon, such as one with industry specific terms.


FYI

An n-gram is often referred to by a different name depending its n size.

1-grams (single words) - unigrams
2-grams (word pairs) - bigrams
3-grams (three word phrases) - trigrams
4-grams (four word phrases) - four-grams
5-grams (five word phrases) - five-grams

As an additional FYI, four-grams are not called "tetragrams" because the term already has usage as a single word consisting of four letters or characters. "Quadrigram" is occasionally used, but four-gram is the more common terminology. Five-grams are not called "pentagrams", because that already has common usage for a geometric figure.

How To

Setup

First, let's set the Word Match extractor on a Data Type.




Changing the N-Gram

By default the Word Match extractor collects single words or unigrams and often collects "words" that aren't actually in the English language. Let's say you want to change this to bigrams and only want to collect words that are included in the Engish dictionary.