2023.1:Lexicon (Node Type)
|
STUB |
This article is a stub. It contains minimal information on the topic and should be expanded. |
A lexicon is a list of words, phrases, or other information. Grooper Lexicons are also text-based lists referenced in various ways by other objects.
About
Lexicons are divided into two parts, Type and Language.
|
Type specifies how data entered into the Lexicon will be interpreted. There are three Types:
|
|
|
Language is what language of the entries for the Lexicon. This, of course, is based off of what language your set of Documents is in. |
The Uses of a Lexicon
Lexicons can be used to:
- Look up values during data extraction
- For example, an extractor could be set up to return first or last names from a Lexicon of common first or last names.
- Translate extracted values from one value to another
- For example, an extractor could be set up to look the full name of a company (ACME Document Corporation) in a Lexicon and translate the result to an abbreviated version (ADC)
- Assign weighting values for fuzzy matching
- Determine the frequency of values within a document set
- and more.
Lexicons in Classification
One of the places where Lexicons shine most is Classification. After all, making use of a document's language is a key point of classification. Lexicons are a great way to help ease the process.
As you can see here, we have two Lexicons we're using for Classification: English Words and English Stop Words. English Words contains every word in the English language, while English Stop Words contains every word that would be considered unimportant and could impede Classification; words like article adjectives, "a", "an", and "the". These words appear frequently just by virtue of their nature within the English language. While they're vital for sentence construction, they can interfere with Classification.