2023.1:Lexicon (Node Type)

From Grooper Wiki
Revision as of 16:37, 26 February 2024 by Dsmith (talk | contribs) (→‎About)

STUB

This article is a stub. It contains minimal information on the topic and should be expanded.

A lexicon is a list of words, phrases, or other information. Grooper Lexicons are also text-based lists referenced in various ways by other objects.

About

Lexicons are divided into two parts, Type and Language.

Type specifies how data entered into the Lexicon will be interpreted. There are three Types:

  • Lookup
    • A Lookup Lexicon contains key-value pairs denoted by an equal sign, '='.
  • Vocabulary
    • A Vocabulary Lexicon consists of a list of values, one per line.
  • Frequency
    • A Frequency Lexicon is made of up key-count pairs. The key is is the string value, and the count is the frequency at which the string appears.

Language is what language of the entries for the Lexicon. This, of course, is based off of what language your set of Documents is in.

The Uses of a Lexicon

Lexicons can be used to:

  • Look up values during data extraction
    • For example, an extractor could be set up to return first or last names from a Lexicon of common first or last names.
  • Translate extracted values from one value to another
    • For example, an extractor could be set up to look the full name of a company (ACME Document Corporation) in a Lexicon and translate the result to an abbreviated version (ADC)
  • Assign weighting values for fuzzy matching
  • Determine the frequency of values within a document set
  • and more.

Lexicons in Classification

One of the places where Lexicons shine most is Classification. After all, making use of a document's language is a key point of classification. Lexicons are a great way to help ease the process.

As you can see here, we have two Lexicons we're using for Classification: English Words and English Stop Words. English Words contains every word in the English language, while English Stop Words contains every word that would be considered unimportant and could impede Classification; words like article adjectives, "a", "an", and "the". These words appear frequently just by virtue of their nature within the English language. While they're vital for sentence construction, they can interfere with Classification.