2023.1:Lexicon (Node Type)

From Grooper Wiki
Revision as of 14:15, 1 March 2024 by Dsmith (talk | contribs) (→‎About)

STUB

This article is a stub. It contains minimal information on the topic and should be expanded.

A lexicon is a list of words, phrases, or other information. Grooper Lexicons are also text-based lists referenced in various ways by other objects.

About

Lexicons are divided into two parts, Type and Language.

Type specifies how data entered into the Lexicon will be interpreted. There are three Types:

  • Lookup
    • A Lookup Lexicon contains key-value pairs denoted by an equal sign, '='.
      • Lookup Lexicons function as translation Lexicons, telling Grooper that two pieces of data are the same. For example, "XYZ Company, LLC = XYZ Company" lets Grooper know that these mean the same thing. That way, when it's time to reference the Lexicon for extraction, bother versions of the data are extracted.
  • Vocabulary
    • A Vocabulary Lexicon consists of a list of values, one per line.
      • This is the most commonly used type of Lexicon. Often, it's the only type one will use.
  • Frequency
    • A Frequency Lexicon is made of up key-count pairs. The key is is the string value, and the count is the frequency at which the string appears.
      • The only time one will ever really use a Frequency Lexicon is when using the Train Lexicon activity to build a key-count list of words for one or more documents.

FYI

Language tells Grooper what specific language you are working with. While this may seem like something important that will need to be set every time a Lexicon is created, the Language property really only comes in handy when it comes to Frequency Lexicons. Otherwise, it's not a necessity.

The Uses of a Lexicon

Lexicons can be used to:

  • Look up values during data extraction
    • For example, an extractor could be set up to return first or last names from a Lexicon of common first or last names.
  • Translate extracted values from one value to another
    • For example, an extractor could be set up to look the full name of a company (ACME Document Corporation) in a Lexicon and translate the result to an abbreviated version (ADC)
  • Assign weighting values for fuzzy matching
  • Determine the frequency of values within a document set
  • and more.

Lexicons in Classification

One of the places where Lexicons shine most is Classification. After all, making use of a document's language is a key point of classification. Lexicons are a great way to help ease the process.

As you can see here, we have two Lexicons we're using for Classification: English Words and English Stop Words. English Words contains every word in the English language, while English Stop Words contains every word that would be considered unimportant and could impede Classification; words like article adjectives, "a", "an", and "the". These words appear frequently just by virtue of their nature within the English language. While they're vital for sentence construction, they can interfere with Classification. As or the second Lexicon, it contains the words most commonly used throughout the English language—words that, unlike the Stop Words, could aid in Classification.



Of course, a Lexicon's only job is to be the dictionary where words are stored. How do they aid in Classification?

Lexicons in Data Extraction

Another area for Lexicons is Word and List Matching. For example, if you have a specific list of numerous names, words, or phrases that you want to capture without making several different Data Types, then a Lexicon can come in handy. Just enter your list of string data, line by line, and use reference your Lexicon for extraction, and Grooper will do the heavy lifting for you, as shown below.