2023.1:Lexicon (Object)
dictionary Lexicon node objects are dictionary objects that store a list of keys or key-value pairs. Lexicons can define local entries and/or import entries from other Lexicons and even import entries using a Data Connection. The entries in a Lexicon can be utilized in different areas of Grooper, such as data extraction, fuzzy matching, or OCR correction, providing a reference point that enhances the accuracy and consistency of the software's operations.
You may download the ZIP(s) below and upload it into your own Grooper environment (version 2023). The first contains one or more Batches of sample documents. The second contains one or more Projects with resources used in examples throughout this article. |
About
Lexicons are divided into two parts, Type and Language.
Type specifies how data entered into the Lexicon will be interpreted. There are three Types:
- Lookup
- A Lookup Lexicon contains key-value pairs denoted by an equal sign, '='.
- Lookup Lexicons function as translation Lexicons, telling Grooper that two pieces of data are the same. For example, "XYZ Company, LLC = XYZ Company" lets Grooper know that these mean the same thing. That way, when it's time to reference the Lexicon for extraction, bother versions of the data are extracted.
- A Lookup Lexicon contains key-value pairs denoted by an equal sign, '='.
- Vocabulary
- A Vocabulary Lexicon consists of a list of values, one per line.
- This is the most commonly used type of Lexicon. Often, it's the only type one will use.
- A Vocabulary Lexicon consists of a list of values, one per line.
- Frequency
- A Frequency Lexicon is made of up key-count pairs. The key is is the string value, and the count is the frequency at which the string appears.
- The only time one will ever really use a Frequency Lexicon is when using the Train Lexicon activity to build a key-count list of words for one or more documents.
- A Frequency Lexicon is made of up key-count pairs. The key is is the string value, and the count is the frequency at which the string appears.
The Uses of a Lexicon
Lexicons can be used to:
- Look up values during data extraction
- For example, an extractor could be set up to return first or last names from a Lexicon of common first or last names.
- Translate extracted values from one value to another
- For example, an extractor could be set up to look the full name of a company (ACME Document Corporation) in a Lexicon and translate the result to an abbreviated version (ADC)
- Assign weighting values for fuzzy matching
- Determine the frequency of values within a document set
- and more.
Lexicons in Classification
Lexicons can aid Feature Extractors by narrowing down important terms that will help Grooper have an easier time classifying Documents. Unlike extractors, a Lexicon's work is more behind the scenes. Its job is to act as a filter for extractors, telling them what words Grooper can and can't examine when performing extraction, or training documents.
This is exemplified below with the two Lexicons, English Words and English Stop Words. The former contains the most frequently used words in the English language, while the latter only contains words that, while a part of the English language, don't really mean much in the grand scheme of things. Article adjectives for example. Words that, while necessary in the construction of everyday sentences, won't do much to aid Grooper, and could even hinder Classification.
Of course, a Lexicon's only job is to be the dictionary where words are stored. How do they aid in Classification?
Take a look at the image below. Notice anything odd? Given our Pattern Match extractor of [A-z]+
, all word characters should be getting extracted, right? Wrong. Remember the Lexicons. We used English Words to tell our extractor what we wanted picked up, while also telling Grooper what we didn't want by having English Stop Words set as our Exclusion. Hence, for example, why the word "if" isn't being extracted despite its numerous appearances on this W-4.
With the back-end, behind-the-scenes work established, what does all this mean for the Classification Activity itself? How does helping identify and filter out what are essentially junk words help Grooper classify documents?
|
|
|
|
Lexicons in Data Extraction
Another area for Lexicons is Word and List Matching. For example, if you have a specific list of numerous names, words, or phrases that you want to capture without making several different Data Types, then a Lexicon can come in handy. Just enter your list of string data, line by line, and use reference your Lexicon for extraction, and Grooper will do the heavy lifting for you, as shown below.
- For this example, we'll be looking at the Lexicon titled, Company Names
- This Lexicon will be a Lookup Lexicon. Some companies on our documents will have different versions of their names. For example, Dos Mangos is also written as dosMangos. Same name, just written differently. Since Grooper doesn't know that, we'll use the translation Lexicon that is the Lookup to help extract both versions of the company name.
- Lookup Lexicons consist of key-value pairs, where the keys are in yellow text and the value in blue; it's a way of telling Grooper that one piece of string data is equivalent to another. "XYZ Company, LLC = XYZ" for instance.
- With the Lexicon set up, let's move down to the Value Reader where the extraction of the company names will take place. To keep it simple, we've named it VE - Company Name.
- For the Extractor, we've chosen a List Match. However, we won't be using any Local Entries. we'll reference our Lexicon and let it do the work for us!
- One thing to note: in order for a Lookup Lexicon to function like a Lookup Lexicon and translate the values given to it, the Translate property MUST be turned to True. Otherwise, the Lookup Lexicon will function like a Vocabulary Lexicon and you won't pick up any of the translated values.
- To see the Lexicon in action, navigate to the Tester tab.
- Since we've referenced a Lexicon for our Extractor, we have no need of any Local Entries.
- Note that we're picking up bother versions of this company name, both Outskirts Territories Electric Supply and Outskirts Territories Electric.
- With the Translation property turned to true, instead of each version of the company name being output as it appears on the document, it's been translated tot he value we set in the Lexicon.