2023:Word Match (Value Extractor): Difference between revisions

From Grooper Wiki
zips // via Wikitext Extension for VSCode
Line 43: Line 43:


First, let's set the ''Word Match'' extractor on a '''Data Type'''.  
First, let's set the ''Word Match'' extractor on a '''Data Type'''.  
 
<br>
 
# The Word Match extractor can be used on a variety of objects wherever an extractor is needed. Select the object in the Node Tree.
# Click the hamburger icon next to the extractor property to access the dropdown and select the extractor.
# Select Word Match.
<br>
[[File:2023 Word Match - 2023 01 How To 01 Setup 01.png]]
[[File:2023 Word Match - 2023 01 How To 01 Setup 01.png]]
 
<br>
 
<br>
# Click the ellipsis icon.
<br>
[[File:2023 Word Match - 2023 01 How To 01 Setup 02.png]]
[[File:2023 Word Match - 2023 01 How To 01 Setup 02.png]]
 
<br>
 
<br>
# Notice that a patterns has already been entered into the Word Pattern box by default.
# The default pattern is designed to collect individual words on a document.
# Click over to the properties tab for more options to customize your extractor.
<br>
[[File:2023 Word Match - 2023 01 How To 01 Setup 03.png]]
[[File:2023 Word Match - 2023 01 How To 01 Setup 03.png]]


=== Adding a Lexicon ===
=== Adding a Lexicon ===

Revision as of 08:08, 13 March 2024

This article is about an older version of Grooper.

Information may be out of date and UI elements may have changed.

20252023

The Word Match is an Extractor Type found in Grooper. This extractor is designed to collect full words and is often used in n-gram extraction.

About

The Word Match extractor is designed for n-gram extraction. An n-gram is "a contiguous sequence of n items from a given sample of text or speech." [1] Typically in Grooper, this refers to extracting words or phrases from a lexicon of terms.

Grooper generally uses n-grams for the purpose of feature collection for Lexical Classification. The Word Match extractor can capture 1-grams (single words) up to 5-grams (five word phrases). Lexicons are commonly used to dictate a dictionary of allowable returned words. This could be general Lexicon of common English words or a custom Lexicon, such as one with industry specific terms.


FYI

An n-gram is often referred to by a different name depending its n size.

1-grams (single words) - unigrams
2-grams (word pairs) - bigrams
3-grams (three word phrases) - trigrams
4-grams (four word phrases) - four-grams
5-grams (five word phrases) - five-grams

As an additional FYI, four-grams are not called "tetragrams" because the term already has usage as a single word consisting of four letters or characters. "Quadrigram" is occasionally used, but four-gram is the more common terminology. Five-grams are not called "pentagrams", because that already has common usage for a geometric figure.


One area where Word Match sees the most use is in Feature Extractors.

How To

You may download the ZIP(s) below and upload it into your own Grooper environment (version 2023). The first contains a Project with resources used in examples throughout this article. The second contains one or more Batches of sample documents.

Setup

First, let's set the Word Match extractor on a Data Type.

  1. The Word Match extractor can be used on a variety of objects wherever an extractor is needed. Select the object in the Node Tree.
  2. Click the hamburger icon next to the extractor property to access the dropdown and select the extractor.
  3. Select Word Match.




  1. Click the ellipsis icon.




  1. Notice that a patterns has already been entered into the Word Pattern box by default.
  2. The default pattern is designed to collect individual words on a document.
  3. Click over to the properties tab for more options to customize your extractor.


Adding a Lexicon

You can add a Lexicon to a Word Match to aid in extraction. Word Match often collects "words" that aren't actually in the English language. Let's say you want to change to to only collect words that are included in the Engish dictionary.



Changing the N-Gram

By default the Word Match extractor collects single words or unigrams. Let's say you want to change this to bigrams, trigrams, four-grams, etc. We can do this from the "Properties" tab on the extractor window.



Join Patterns

Sometimes words can be separated by something other than a single space. Words can be hyphenated, have commas between them, be connected with an ampersand, or other number of things. To still include these words in an n-gram, you might want to write a Join Pattern using regex for your extractor.



For information on the Prefix and Suffix patterns, visit the Data Context wiki page.

Additional Information

Now that you know how to set up the features that can assist your Word Match extractor, let's look at some additional information regarding some of these features that will help you build a better Word Match extractor.

N-Grams: Uni vs Multi

When it comes to using N-Grams, which is the best approach? Unigrams that will only extract one word, or N-Grams that will pick up multi-word phrases? The ultimate answer is, that it depends on what type of document you're trying to extract data from (and potentially classify), let's look at some examples where a unigram may work over a multi-n-gram and vice versa.

See Also: