2023:Word Match (Value Extractor): Difference between revisions

From Grooper Wiki
Created page with "{|class="wip-box" | '''WIP''' | This article is a work-in-progress or created as a placeholder for testing purposes. This article is subject to change and/or expansion. It may be incomplete, inaccurate, or stop abruptly. This tag will be removed upon draft completion. |} === Word Match === The ''Word Match'' extractor is designed for n-gram extraction. An n-gram is "a contiguous sequence of n items from a given sample of text or speech." [https://en.wikipedia.org/w..."
 
m Dgreenwood moved page 2023:Word Match (Extractor Type) to 2023:Word Match (Value Extractor) without leaving a redirect
 
(53 intermediate revisions by 4 users not shown)
Line 1: Line 1:
{|class="wip-box"
{{AutoVersion}}
 
<blockquote>{{#lst:Glossary|Word Match}}</blockquote>
 
{|class="download-box"
|
|
'''WIP'''
[[File:Asset 22@4x.png]]
|
|
This article is a work-in-progress or created as a placeholder for testing purposesThis article is subject to change and/or expansion. It may be incomplete, inaccurate, or stop abruptly.
You may download the ZIP(s) below and upload it into your own Grooper environment (version 2023). The first contains one or more '''Batches''' of sample documentsThe second contains one or more '''Projects''' with resources used in examples throughout this article.  
* [[MEdia:2023 Wiki Word-Match Batches.zip]]
* [[MEdia:2023 Wiki Word-Match Projects.zip]]
|}


This tag will be removed upon draft completion.
== About ==
|}
The ''Word Match'' extractor is designed for n-gram extraction.  An n-gram is "a contiguous sequence of n items from a given sample of text or speech." [https://en.wikipedia.org/wiki/N-gram]  Typically in Grooper, this refers to extracting words or phrases from a lexicon of terms. 


=== Word Match ===
Grooper generally uses n-grams for the purpose of feature collection for Lexical Classification.  The ''Word Match'' extractor can capture 1-grams (single words) up to 5-grams (five word phrases).  '''''Lexicons''''' are commonly used to dictate a dictionary of allowable returned words.  This could be general '''Lexicon''' of common English words or a custom '''Lexicon''', such as one with industry specific terms.


The ''Word Match'' extractor is designed for n-gram extraction.  An n-gram is "a contiguous sequence of n items from a given sample of text or speech." [https://en.wikipedia.org/wiki/N-gram]  Typically in Grooper, this refers to extracting words or phrases from a lexicon of terms.  Often, this is for the purposes of feature collection for ''[[Lexical]]'' [[Classification]].  The ''Word Match'' extractor can capture 1-grams (single words) up to 5-grams (five word phrases).  '''''Lexicons''''' are commonly used to dictate a dictionary of allowable returned words.  This could be general '''Lexicon''' of common English words or a custom '''Lexicon''', such as one with industry specific terms.


{|cellpadding="10" cellspacing="5"
{|class="fyi-box"
|-style="background-color:#36b0a7; color:white"
|
|style="font-size:14pt"|'''FYI'''
'''FYI'''
|
|
An n-gram is often referred to by a different name depending its ''n'' size.
An n-gram is often referred to by a different name depending its ''n'' size.
Line 26: Line 32:
As an additional FYI, four-grams are ''not'' called "tetragrams" because the term already has usage as a single word consisting of four letters or characters.  "Quadrigram" is occasionally used, but four-gram is the more common terminology. Five-grams are not called "pentagrams", because that already has common usage for a geometric figure.
As an additional FYI, four-grams are ''not'' called "tetragrams" because the term already has usage as a single word consisting of four letters or characters.  "Quadrigram" is occasionally used, but four-gram is the more common terminology. Five-grams are not called "pentagrams", because that already has common usage for a geometric figure.
|}
|}
<br>
One area where Word Match sees the most use is in Feature Extractors.
== How To ==


Just like with ''Pattern Match'', you can enter '''''Prefix''''' and '''''Suffix Patterns''''' to only return an n-gram if a regex pattern ''also'' matches before or after.  These are useful for anchoring the n-gram you want to return next to some other piece of text.  For example, a '''''Prefix Pattern''''' of <code>\n</code> could be used to ''only'' return n-grams at the start of a new line because the <code>\n</code> character precedes every new line in the text data.  Furthermore, ''only'' the n-gram is returned, not the text matched by the '''''Prefix''''' and '''''Suffix Patterns'''''.


The '''''Join Pattern''''' property is unique to the ''Word Match'' extractor.  This determines how terms of bigrams, trigrams, four-grams, and five-grams can be joined.  Most often, terms (or grams) are simply joined by a single space, as in the bigram "''first second''".  If you leave this property blank, Grooper will assume n-grams are always separated by a single space.  However, you may want to include n-grams that are separated by other characters.  For example hyphenated words, as in "''first-second''".  The '''''Join Pattern''''' allows you to enter a regular expression for the allowable characters between two grams.  For example, a '''''Join Pattern''''' of <code>[ -]</code> would allow for a single space or hyphen to be between each term, matching "''first second''" as well as "''first-second''".
=== Setup ===


The '''''Output Format''''' allows you to alter the output result for data cleansing or other purposes.
First, let's set the ''Word Match'' extractor on a '''Data Type'''.
<br>
<br>
# The Word Match extractor can be used on a variety of objects wherever an extractor is needed. Select the object in the Node Tree.
# Click the hamburger icon next to the extractor property to access the dropdown and select the extractor.
# Select Word Match.
[[File:2023 Word Match - 2023 01 How To 01 Setup 01.png]]
<br>
<br>
# Click the ellipsis icon.
[[File:2023 Word Match - 2023 01 How To 01 Setup 02.png]]
<br>
<br>
# Notice that a patterns has already been entered into the Word Pattern box by default.
# The default pattern is designed to collect individual words on a document.
# Click over to the properties tab for more options to customize your extractor.
[[File:2023 Word Match - 2023 01 How To 01 Setup 03.png]]


The "Properties" tab allows you to further configure the n-gram matching.  Most importantly, the n-gram size is set here as well as any '''Lexicon''' used to lookup against the returned values.  You can also enable [[Tab Marking]], [[Fuzzy RegEx]] mode, filter results based on page location, determine case sensitivity, and more.
=== Adding a Lexicon ===


{|cellpadding=10 cellpadding=5
You can add a '''Lexicon''' to a ''Word Match'' to aid in extraction. ''Word Match'' often collects "words" that aren't actually in the English language. Let's say you want to change to to only collect words that are included in the Engish dictionary.
|valign=top style="width:40%"|
<br>
In this example, a '''Value Reader''' is configured to return bigram field labels, using the ''Word Match'' '''''Extractor Type'''''.
<br>
# Click over to the Properties tab.
# Under the 'Options' category, click the ellipsis icon to the right of the Word Lookup property.
# When the Word Lookup window pops up, open the Vocabulary sub-properties and click on the ellipsis icon to the right of the Included Lexicons property.
# When the Included Lexicons window pops up, search through the folders to find the desired Lexicon and click the checkbox next to it to select it.
# Click "OK" on both the Included Lexicons and Word Lookup windows to save your changes.
[[File:2023 Word Match - 2023 01 How To 02 Changing the N-Gram 01.png]]


# ''Word Match'' is selected as the '''''Extractor Type'''''
=== Changing the N-Gram ===
# The '''''Word Pattern''''' is entered here.
 
#* The regex pattern entered here is used to match each single gram in the n-gram. The default pattern <code>\p{L}+</code> matches any combination of letter characters in any language of any length.  In most cases, this pattern will perfectly suit your n-gram extraction needs. However, you can alter this pattern if you need. For example, <code>[a-zA-Z]+</code> is a very similar pattern that could be used to match English only words, as it does not include characters of foreign scripts. For example, it would not match Greek characters, such as Ω, where <code>\p{L}+</code> would.
By default the ''Word Match'' extractor collects single words or unigrams. Let's say you want to change this to bigrams, trigrams, four-grams, etc. We can do this from the "Properties" tab on the extractor window.
# The '''''Prefix Pattern''''' is entered here.
<br>
#* In this case, the pattern entered will only match n-grams if they are preceded by a <code>\n</code> <code>\t</code> or beginning of string <code>^</code> character.
<br>
# The '''''Suffix Pattern''''' is entered here.
# Now to change the extractor from a unigram to a bigram (or trigram, four-gram, etc.) click the hamburger icon to the right of the Phrase Size property.
#* In this case, the pattern entered will only match n-grams if they are followed by a <code>\r</code> <code>\t</code> or end of string <code>$</code> character.
# Select the desired number from the dropdown.
# The '''''Join Pattern''''' is entered here.
[[File:2023 Word Match - 2023 01 How To 02 Changing the N-Gram 02.png]]
#* The pattern here, <code>[ \-]</code> will return n-grams whos grams are separated by a single space character, a backspace, or a hyphen. If left blank, only n-grams whose grams are separated by a single space character are returned.
<br>
# The '''''Output Format''''' is formatted here.
<br>
#* Unused in this example.
# Now we see that Grooper is returning bigrams and only grabbing English words.
|
# Notice that here we have bigrams that overlap. For example, "this amount" and "amount increase" are two separate bigrams and both are captured.
[[File:Value-reader-extractor-types-05.png]]
[[File:2023 Word Match - 2023 01 How To 02 Changing the N-Gram 03.png]]
 
=== Join Patterns ===
 
Sometimes words can be separated by something other than a single space. Words can be hyphenated, have commas between them, be connected with an ampersand, or other number of things. To still include these words in an n-gram, you might want to write a Join Pattern using regex for your extractor.
<br>
<br>
#Notice that on this page, we are not collecting this word "interest" as part of a bigram. This is because of the comma between words. We can account for this with a Join Pattern.
[[File:2023 Word Match - 2023 01 How To 03 Join Patterns 01.png]]
<br>
<br>
# We have written a regex pattern to tell Grooper that space or comma can separate words in a bigram. We have also given it s quantifier of 1 to 2, so multiples can join bigrams, such as both a comma and a space in the example.
# Now "interest" is included as part of a bigram.
[[File:2023 Word Match - 2023 01 How To 03 Join Patterns 02.png]]
 
{|class="attn-box"
|-
|-
|valign=top|
|
In this case, we also used the "Properties" tab to set the n-gram size to collect bigrams, and only return grams in a English language dictionary.
 
# Navigate to the "Properties" tab.
# The '''''Word Lookup''''' property can be used to reference a '''Lexicon''' of allowable terms for each gram in the n-gram.
#* Here, we reference the "English Words" '''Lexicon''' that ships with every Grooper install in the "Essentials" folder of the '''Global Resources''' folder.
# The '''''Phrase Size''''' property allows you to specify the size of the n-gram.
#* Here, it is set to ''2'' to capture bigrams.
|
|
[[File:Value-reader-extractor-types-06.png]]
For information on the Prefix and Suffix patterns, visit the [[Data Context]] wiki page.
|}
|}


== Additional Information ==
Now that you know how to set up the features that can assist your Word Match extractor, let's look at some additional information regarding some of these features that will help you build a better Word Match extractor.
=== N-Grams: Uni vs Multi ===
When it comes to using N-Grams, which is the best approach? Unigrams that will only extract one word, or N-Grams that will pick up multi-word phrases? The ultimate answer lies in how specific you want to be. Or rather, what level of specificity would give you the results you want. Let's look at some examples where a unigram may work over a multi-n-gram and vice versa.
<br>
Below we have an example document titled, "Data Information Sheet". There are three ways the text data of the title can be extracted:
# Each word as individual pieces, i.e. a unigram:
#* Data|Information|Sheet
# An overlapping bigram:
#* Data Information|Information Sheet
# A trigram that extracts the entire title as one piece of data:
#* Data Information Sheet
<br>
<br>
Phrase Size is the property used to determine the size of the N-Gram used. 1 denotes a unigram, 2 a brigram, 3 a trigram, so on and so forth. Normally, you won't need any N-Gram beyond a trigram, but the Phrase Size can go up to 5, as will be shown further down. For this example, we'll be looking at a case where we want to use N-Grams for Classification. Hence, we'll be focusing on the titles of the documents being used.
<br>
<br>
Extracting the title as one piece of data appears to be the best approach. Remember, it's all about how specific we want to be. For example, let's say we wanted to use the N-Grams for assistance in classification. In that case, being as specific as possible and using a trigram would be the best approach.
<br>
[[File:2023_Word_Match_02_Additional_Information_N_-_Grams_Uni_vs_Multi_01.png]]
<br>
<br>
# Testing out the application of an N-Gram, we've bumped the phrase size up to 2 - a bigram.
# Odd. Now we're not extracting anything. What happened?
[[File:2023_Word_Match_02_Additional_Information_N_-_Grams_Uni_vs_Multi_02.png]]
<br>
<br>
#<li value=3> The title of the document is three words. Perhaps a trigram will work?
# ...and it didn't.
[[File:2023_Word_Match_02_Additional_Information_N_-_Grams_Uni_vs_Multi_03.png]]
<br>
<br>
OCR errors are poison to multigrams. It's difficult to extract multiple pieces of text data when some of the text had been read incorrectly by Grooper. In a case like this where one of the words of the title was not OCR's properly, we are left with only the unigram as an option, as this way we can at least extract "Data" and "Sheet" with no issue.
<br>
#<li value=5> As we can see, we have an OCR error where "INFORMATION" was read as "INF0RMATION".
[[File:2023_Word_Match_02_Additional_Information_N_-_Grams_Uni_vs_Multi_07.png]]
<br>
<br>
#<li value=6> Here we have a second folder with an identical document.
# As you can see, there are no OCR errors; every word in the title is being extracted.
[[File:2023_Word_Match_02_Additional_Information_N_-_Grams_Uni_vs_Multi_08.png]]
<br>
<br>
#<li value=8> Without the issue of OCR errors, we are able to increase our N-Gram from a unigram to a trigram to capture the entire title as one piece of text data.
# And voila! That's exactly what we've done.
[[File:2023_Word_Match_02_Additional_Information_N_-_Grams_Uni_vs_Multi_09.png]]
<br>
<br>
<br>
To sum up, deciding whether or not to use a unigram vs a multigram depends on two things: how specific you want to be, and if there are any OCR errors. Mind, you can have a perfectly OCR'd document and not need to be specific, instead settling for a unigram. Just be aware that if you do want to be more specific and extract more than one piece of text data, you will need to be vigilliant for OCR errors.


{|cellpadding="10" cellspacing="5"
=== ===
|-style="background-color:#36b0a7; color:white"
== See Also: ==
|style="font-size:14pt"|'''FYI'''||Prior to Grooper version 2021, n-gram extraction configuration was lumped into other regular expression pattern configurations.  As with the ''Pattern Match'' extractor, this was delivered in one of two ways:


:1. By the '''Data Format''' object.
* [[Value Reader]]
:2. Configuring extractor properties and selecting ''Internal'' or ''Text Pattern'' as the extractor type.
* [[Data Context]]
 
Each of these methods used a "Pattern Editor" UI screen to configure a regular expression.  The n-gram size and referenced term lexicons were set in the "Properties" tab.  In version 2021, the '''Data Format''' object and the ''Internal'' and ''Text Pattern'' extractor types are gone.  The ''Word Match'' extractor replaces their functionality to return n-grams in an effort to simplify n-gram extraction setup and distinguish it from general regex pattern matching.
|}

Latest revision as of 16:13, 27 August 2025

This article is about an older version of Grooper.

Information may be out of date and UI elements may have changed.

20252023

Word Match is a Value Extractor that extracts individual words or phrases from documents. It is used for n-gram extraction. Each gram may be optionally executed against a dictionary Lexicon to ensure words and phrases only match a set vocabulary.

You may download the ZIP(s) below and upload it into your own Grooper environment (version 2023). The first contains one or more Batches of sample documents. The second contains one or more Projects with resources used in examples throughout this article.

About

The Word Match extractor is designed for n-gram extraction. An n-gram is "a contiguous sequence of n items from a given sample of text or speech." [1] Typically in Grooper, this refers to extracting words or phrases from a lexicon of terms.

Grooper generally uses n-grams for the purpose of feature collection for Lexical Classification. The Word Match extractor can capture 1-grams (single words) up to 5-grams (five word phrases). Lexicons are commonly used to dictate a dictionary of allowable returned words. This could be general Lexicon of common English words or a custom Lexicon, such as one with industry specific terms.


FYI

An n-gram is often referred to by a different name depending its n size.

1-grams (single words) - unigrams
2-grams (word pairs) - bigrams
3-grams (three word phrases) - trigrams
4-grams (four word phrases) - four-grams
5-grams (five word phrases) - five-grams

As an additional FYI, four-grams are not called "tetragrams" because the term already has usage as a single word consisting of four letters or characters. "Quadrigram" is occasionally used, but four-gram is the more common terminology. Five-grams are not called "pentagrams", because that already has common usage for a geometric figure.


One area where Word Match sees the most use is in Feature Extractors.

How To

Setup

First, let's set the Word Match extractor on a Data Type.

  1. The Word Match extractor can be used on a variety of objects wherever an extractor is needed. Select the object in the Node Tree.
  2. Click the hamburger icon next to the extractor property to access the dropdown and select the extractor.
  3. Select Word Match.



  1. Click the ellipsis icon.



  1. Notice that a patterns has already been entered into the Word Pattern box by default.
  2. The default pattern is designed to collect individual words on a document.
  3. Click over to the properties tab for more options to customize your extractor.

Adding a Lexicon

You can add a Lexicon to a Word Match to aid in extraction. Word Match often collects "words" that aren't actually in the English language. Let's say you want to change to to only collect words that are included in the Engish dictionary.

  1. Click over to the Properties tab.
  2. Under the 'Options' category, click the ellipsis icon to the right of the Word Lookup property.
  3. When the Word Lookup window pops up, open the Vocabulary sub-properties and click on the ellipsis icon to the right of the Included Lexicons property.
  4. When the Included Lexicons window pops up, search through the folders to find the desired Lexicon and click the checkbox next to it to select it.
  5. Click "OK" on both the Included Lexicons and Word Lookup windows to save your changes.

Changing the N-Gram

By default the Word Match extractor collects single words or unigrams. Let's say you want to change this to bigrams, trigrams, four-grams, etc. We can do this from the "Properties" tab on the extractor window.

  1. Now to change the extractor from a unigram to a bigram (or trigram, four-gram, etc.) click the hamburger icon to the right of the Phrase Size property.
  2. Select the desired number from the dropdown.



  1. Now we see that Grooper is returning bigrams and only grabbing English words.
  2. Notice that here we have bigrams that overlap. For example, "this amount" and "amount increase" are two separate bigrams and both are captured.

Join Patterns

Sometimes words can be separated by something other than a single space. Words can be hyphenated, have commas between them, be connected with an ampersand, or other number of things. To still include these words in an n-gram, you might want to write a Join Pattern using regex for your extractor.

  1. Notice that on this page, we are not collecting this word "interest" as part of a bigram. This is because of the comma between words. We can account for this with a Join Pattern.



  1. We have written a regex pattern to tell Grooper that space or comma can separate words in a bigram. We have also given it s quantifier of 1 to 2, so multiples can join bigrams, such as both a comma and a space in the example.
  2. Now "interest" is included as part of a bigram.

For information on the Prefix and Suffix patterns, visit the Data Context wiki page.

Additional Information

Now that you know how to set up the features that can assist your Word Match extractor, let's look at some additional information regarding some of these features that will help you build a better Word Match extractor.

N-Grams: Uni vs Multi

When it comes to using N-Grams, which is the best approach? Unigrams that will only extract one word, or N-Grams that will pick up multi-word phrases? The ultimate answer lies in how specific you want to be. Or rather, what level of specificity would give you the results you want. Let's look at some examples where a unigram may work over a multi-n-gram and vice versa.
Below we have an example document titled, "Data Information Sheet". There are three ways the text data of the title can be extracted:

  1. Each word as individual pieces, i.e. a unigram:
    • Data|Information|Sheet
  2. An overlapping bigram:
    • Data Information|Information Sheet
  3. A trigram that extracts the entire title as one piece of data:
    • Data Information Sheet



Phrase Size is the property used to determine the size of the N-Gram used. 1 denotes a unigram, 2 a brigram, 3 a trigram, so on and so forth. Normally, you won't need any N-Gram beyond a trigram, but the Phrase Size can go up to 5, as will be shown further down. For this example, we'll be looking at a case where we want to use N-Grams for Classification. Hence, we'll be focusing on the titles of the documents being used.

Extracting the title as one piece of data appears to be the best approach. Remember, it's all about how specific we want to be. For example, let's say we wanted to use the N-Grams for assistance in classification. In that case, being as specific as possible and using a trigram would be the best approach.


  1. Testing out the application of an N-Gram, we've bumped the phrase size up to 2 - a bigram.
  2. Odd. Now we're not extracting anything. What happened?



  1. The title of the document is three words. Perhaps a trigram will work?
  2. ...and it didn't.



OCR errors are poison to multigrams. It's difficult to extract multiple pieces of text data when some of the text had been read incorrectly by Grooper. In a case like this where one of the words of the title was not OCR's properly, we are left with only the unigram as an option, as this way we can at least extract "Data" and "Sheet" with no issue.

  1. As we can see, we have an OCR error where "INFORMATION" was read as "INF0RMATION".



  1. Here we have a second folder with an identical document.
  2. As you can see, there are no OCR errors; every word in the title is being extracted.



  1. Without the issue of OCR errors, we are able to increase our N-Gram from a unigram to a trigram to capture the entire title as one piece of text data.
  2. And voila! That's exactly what we've done.




To sum up, deciding whether or not to use a unigram vs a multigram depends on two things: how specific you want to be, and if there are any OCR errors. Mind, you can have a perfectly OCR'd document and not need to be specific, instead settling for a unigram. Just be aware that if you do want to be more specific and extract more than one piece of text data, you will need to be vigilliant for OCR errors.

See Also: