Word Match (Extractor Type)

From Grooper Wiki

This article was migrated from an older version and has not been updated for the current version of Grooper.

This tag will be removed upon article review and update.

This article is about the current version of Grooper.

Note that some content may still need to be updated.

2025 2023

Word Match is an Extractor Type that extracts individual words or phrases from documents. It used for n-gram extraction. Each gram may be optionally executed against a dictionary Lexicon to ensure words and phrases only match a set vocabulary.

You may download the ZIP(s) below and upload it into your own Grooper environment (version 2023). The first contains one or more Batches of sample documents. The second contains one or more Projects with resources used in examples throughout this article.

About

The Word Match extractor is designed for n-gram extraction. An n-gram is "a contiguous sequence of n items from a given sample of text or speech." [1] Typically in Grooper, this refers to extracting words or phrases from a lexicon of terms.

Grooper generally uses n-grams for the purpose of feature collection for Lexical Classification. The Word Match extractor can capture 1-grams (single words) up to 5-grams (five word phrases). Lexicons are commonly used to dictate a dictionary of allowable returned words. This could be general Lexicon of common English words or a custom Lexicon, such as one with industry specific terms.


FYI

An n-gram is often referred to by a different name depending its n size.

1-grams (single words) - unigrams
2-grams (word pairs) - bigrams
3-grams (three word phrases) - trigrams
4-grams (four word phrases) - four-grams
5-grams (five word phrases) - five-grams

As an additional FYI, four-grams are not called "tetragrams" because the term already has usage as a single word consisting of four letters or characters. "Quadrigram" is occasionally used, but four-gram is the more common terminology. Five-grams are not called "pentagrams", because that already has common usage for a geometric figure.


One area where Word Match sees the most use is in Feature Extractors.

How To

Setup

First, let's set the Word Match extractor on a Data Type.

  1. The Word Match extractor can be used on a variety of objects wherever an extractor is needed. Select the object in the Node Tree.
  2. Click the hamburger icon next to the extractor property to access the dropdown and select the extractor.
  3. Select Word Match.



  1. Click the ellipsis icon.



  1. Notice that a patterns has already been entered into the Word Pattern box by default.
  2. The default pattern is designed to collect individual words on a document.
  3. Click over to the properties tab for more options to customize your extractor.

Adding a Lexicon

You can add a Lexicon to a Word Match to aid in extraction. Word Match often collects "words" that aren't actually in the English language. Let's say you want to change to to only collect words that are included in the Engish dictionary.

  1. Click over to the Properties tab.
  2. Under the 'Options' category, click the ellipsis icon to the right of the Word Lookup property.
  3. When the Word Lookup window pops up, open the Vocabulary sub-properties and click on the ellipsis icon to the right of the Included Lexicons property.
  4. When the Included Lexicons window pops up, search through the folders to find the desired Lexicon and click the checkbox next to it to select it.
  5. Click "OK" on both the Included Lexicons and Word Lookup windows to save your changes.

Changing the N-Gram

By default the Word Match extractor collects single words or unigrams. Let's say you want to change this to bigrams, trigrams, four-grams, etc. We can do this from the "Properties" tab on the extractor window.

  1. Now to change the extractor from a unigram to a bigram (or trigram, four-gram, etc.) click the hamburger icon to the right of the Phrase Size property.
  2. Select the desired number from the dropdown.



  1. Now we see that Grooper is returning bigrams and only grabbing English words.
  2. Notice that here we have bigrams that overlap. For example, "this amount" and "amount increase" are two separate bigrams and both are captured.

Join Patterns

Sometimes words can be separated by something other than a single space. Words can be hyphenated, have commas between them, be connected with an ampersand, or other number of things. To still include these words in an n-gram, you might want to write a Join Pattern using regex for your extractor.

  1. Notice that on this page, we are not collecting this word "interest" as part of a bigram. This is because of the comma between words. We can account for this with a Join Pattern.



  1. We have written a regex pattern to tell Grooper that space or comma can separate words in a bigram. We have also given it s quantifier of 1 to 2, so multiples can join bigrams, such as both a comma and a space in the example.
  2. Now "interest" is included as part of a bigram.

For information on the Prefix and Suffix patterns, visit the Data Context wiki page.

Additional Information

Now that you know how to set up the features that can assist your Word Match extractor, let's look at some additional information regarding some of these features that will help you build a better Word Match extractor.

N-Grams: Uni vs Multi

When it comes to using N-Grams, which is the best approach? Unigrams that will only extract one word, or N-Grams that will pick up multi-word phrases? The ultimate answer lies in how specific you want to be. Or rather, what level of specificity would give you the results you want. Let's look at some examples where a unigram may work over a multi-n-gram and vice versa.
Below we have an example document titled, "Data Information Sheet". There are three ways the text data of the title can be extracted:

  1. Each word as individual pieces, i.e. a unigram:
    • Data|Information|Sheet
  2. An overlapping bigram:
    • Data Information|Information Sheet
  3. A trigram that extracts the entire title as one piece of data:
    • Data Information Sheet



Phrase Size is the property used to determine the size of the N-Gram used. 1 denotes a unigram, 2 a brigram, 3 a trigram, so on and so forth. Normally, you won't need any N-Gram beyond a trigram, but the Phrase Size can go up to 5, as will be shown further down. For this example, we'll be looking at a case where we want to use N-Grams for Classification. Hence, we'll be focusing on the titles of the documents being used.

Extracting the title as one piece of data appears to be the best approach. Remember, it's all about how specific we want to be. For example, let's say we wanted to use the N-Grams for assistance in classification. In that case, being as specific as possible and using a trigram would be the best approach.


  1. Testing out the application of an N-Gram, we've bumped the phrase size up to 2 - a bigram.
  2. Odd. Now we're not extracting anything. What happened?



  1. The title of the document is three words. Perhaps a trigram will work?
  2. ...and it didn't.



OCR errors are poison to multigrams. It's difficult to extract multiple pieces of text data when some of the text had been read incorrectly by Grooper. In a case like this where one of the words of the title was not OCR's properly, we are left with only the unigram as an option, as this way we can at least extract "Data" and "Sheet" with no issue.

  1. As we can see, we have an OCR error where "INFORMATION" was read as "INF0RMATION".



  1. Here we have a second folder with an identical document.
  2. As you can see, there are no OCR errors; every word in the title is being extracted.



  1. Without the issue of OCR errors, we are able to increase our N-Gram from a unigram to a trigram to capture the entire title as one piece of text data.
  2. And voila! That's exactly what we've done.




To sum up, deciding whether or not to use a unigram vs a multigram depends on two things: how specific you want to be, and if there are any OCR errors. Mind, you can have a perfectly OCR'd document and not need to be specific, instead settling for a unigram. Just be aware that if you do want to be more specific and extract more than one piece of text data, you will need to be vigilliant for OCR errors.

See Also: