2023.1:Lexicon (Node Type)

From Grooper Wiki
Revision as of 09:55, 27 August 2024 by Randallkinard (talk | contribs)

dictionary Lexicons are dictionaries used throughout Grooper to store lists of words, phrases, weightings for Fuzzy RegEx, and more. Users can add entries to a Lexicon, Lexicons can import entries from other Lexicons by referencing them, and entries can be dynamically imported from a database using a database Data Connection. Lexicons are commonly used to aid in data extraction, with the "List Match" and "Word Match" extractors utilizing them most commonly.

You may download the ZIP(s) below and upload it into your own Grooper environment (version 2023). The first contains one or more Batches of sample documents. The second contains one or more Projects with resources used in examples throughout this article.

About

Lexicons are divided into two parts, Type and Language.

Type specifies how data entered into the Lexicon will be interpreted. There are three Types:

  • Lookup
    • A Lookup Lexicon contains key-value pairs denoted by an equal sign, '='.
      • Lookup Lexicons function as translation Lexicons, telling Grooper that two pieces of data are the same. For example, "XYZ Company, LLC = XYZ Company" lets Grooper know that these mean the same thing. That way, when it's time to reference the Lexicon for extraction, bother versions of the data are extracted.
  • Vocabulary
    • A Vocabulary Lexicon consists of a list of values, one per line.
      • This is the most commonly used type of Lexicon. Often, it's the only type one will use.
  • Frequency
    • A Frequency Lexicon is made of up key-count pairs. The key is is the string value, and the count is the frequency at which the string appears.
      • The only time one will ever really use a Frequency Lexicon is when using the Train Lexicon activity to build a key-count list of words for one or more documents.



FYI

Language tells Grooper what specific language you are working with. This may seem like a required property, but it is not. The Language property will define the allowable Character Set. The Language property is configured when using Train Lexicon (with its Language Aware Training property enabled) to build Lexicons to better ensure only words of certain language are trained.

The Uses of a Lexicon

Lexicons can be used to:

  • Look up values during data extraction
    • For example, an extractor could be set up to return first or last names from a Lexicon of common first or last names.
  • Translate extracted values from one value to another
    • For example, an extractor could be set up to look the full name of a company (ACME Document Corporation) in a Lexicon and translate the result to an abbreviated version (ADC)
  • Assign weighting values for fuzzy matching
  • Determine the frequency of values within a document set
  • and more.

Lexicons in Classification

Lexicons can aid Feature Extractors by narrowing down important terms that will help Grooper have an easier time classifying Documents. Unlike extractors, a Lexicon's work is more behind the scenes. Its job is to act as a filter for extractors, telling them what words Grooper can and can't examine when performing extraction, or training documents.

This is exemplified below with the two Lexicons, English Words and English Stop Words. The former contains the most frequently used words in the English language, while the latter only contains words that, while a part of the English language, don't really mean much in the grand scheme of things. Article adjectives for example. Words that, while necessary in the construction of everyday sentences, won't do much to aid Grooper, and could even hinder Classification.



Of course, a Lexicon's only job is to be the dictionary where words are stored. How do they aid in Classification?

Take a look at the image below. Notice anything odd? Given our Pattern Match extractor of [A-z]+, all word characters should be getting extracted, right? Wrong. Remember the Lexicons. We used English Words to tell our extractor what we wanted picked up, while also telling Grooper what we didn't want by having English Stop Words set as our Exclusion. Hence, for example, why the word "if" isn't being extracted despite its numerous appearances on this W-4.



With the back-end, behind-the-scenes work established, what does all this mean for the Classification Activity itself? How does helping identify and filter out what are essentially junk words help Grooper classify documents?

  1. First, create a Batch Process and add a Classify step.
  2. The Activity will be set to Classify, the Scope is set to Folder, and the Folder Level is 1.
  3. For the Content Model scope, choose the one that contains the Document Types you want to use for Classification. This example will use the 03.02 Training Demo (HR Docs) - Content Model.




  1. With the Batch Process Step configured, select the Classification Tester tab.
  2. Right-click on what you wish to train. Select "Classification", followed by "Train As..."




  1. In the Train As window, select the hamburger icon at the far right of the property to access the drop-down menu.
  2. Expand the desired Content Model and select the Document Type you wish to use.




  1. With that done, we have now successfully trained a Document as a Federal W-4, as per the Document Type.
  2. One thing to note is the Similarity Scores. Since the Document was trained and classified as the desired Document Type, its score is now coming in at 100%


Lexicons in Data Extraction

Another area for Lexicons is Word and List Matching. For example, if you have a specific list of numerous names, words, or phrases that you want to capture without making several different Data Types, then a Lexicon can come in handy. Just enter your list of string data, line by line, and use reference your Lexicon for extraction, and Grooper will do the heavy lifting for you, as shown below.

  1. For this example, we'll be looking at the Lexicon titled, Company Names
  2. This Lexicon will be a Lookup Lexicon. Some companies on our documents will have different versions of their names. For example, Dos Mangos is also written as dosMangos. Same name, just written differently. Since Grooper doesn't know that, we'll use the translation Lexicon that is the Lookup to help extract both versions of the company name.
  3. Lookup Lexicons consist of key-value pairs, where the keys are in yellow text and the value in blue; it's a way of telling Grooper that one piece of string data is equivalent to another. "XYZ Company, LLC = XYZ" for instance.





  1. With the Lexicon set up, let's move down to the Value Reader where the extraction of the company names will take place. To keep it simple, we've named it VE - Company Name.
  2. For the Extractor, we've chosen a List Match. However, we won't be using any Local Entries. we'll reference our Lexicon and let it do the work for us!
  3. One thing to note: in order for a Lookup Lexicon to function like a Lookup Lexicon and translate the values given to it, the Translate property MUST be turned to True. Otherwise, the Lookup Lexicon will function like a Vocabulary Lexicon and you won't pick up any of the translated values.





  1. To see the Lexicon in action, navigate to the Tester tab.
  2. Since we've referenced a Lexicon for our Extractor, we have no need of any Local Entries.
  3. Note that we're picking up bother versions of this company name, both Outskirts Territories Electric Supply and Outskirts Territories Electric.
  4. With the Translation property turned to true, instead of each version of the company name being output as it appears on the document, it's been translated tot he value we set in the Lexicon.


Glossary

Activity: Grooper Activities define specific document processing operations done to a inventory_2 Batch, folder Batch Folder, or contract Batch Page. In a settings Batch Process, each edit_document Batch Process Step executes a single Activity (determined by the step's "Activity" property).

  • Batch Process Steps are frequently referred by the name of their configured Activity followed by the word "step". For example: "Classify step".

Batch Process Step: edit_document Batch Process Steps are specific actions within a settings Batch Process sequence. Each Batch Process Step performs an "Activity" specific to some document processing task. These Activities will either be a "Code Activity" or "Review" activities. Code Activities are automated by Activity Processing services. Review activities are executed by human operators in the Grooper user interface.

  • Batch Process Steps are frequently referred to as simply "steps".
  • Because a single Batch Process Step executes a single Activity configuration, they are often referred to by their referenced Activity as well. For example, a "Recognize step".

Batch Process: settings Batch Process nodes are crucial components in Grooper's architecture. A Batch Process is the step-by-step processing instructions given to a inventory_2 Batch. Each step is comprised of a "Code Activity" or a Review activity. Code Activities are automated by Activity Processing services. Review activities are executed by human operators in the Grooper user interface.

  • Batch Processes by themselves do nothing. Instead, they execute edit_document Batch Process Steps which are added as children nodes.
  • A Batch Process is often referred to as simply a "process".

Batch: inventory_2 Batch nodes are fundamental in Grooper's architecture. They are containers of documents that are moved through workflow mechanisms called settings Batch Processes. Documents and their pages are represented in Batches by a hierarchy of folder Batch Folders and contract Batch Pages.

Classification: Classification is the process of identifying and organizing documents into categorical types based on their content or layout. Classification is key for efficient document management and data extraction workflows. Grooper has different methods for classifying documents. These include methods that use machine learning and text pattern recognition. In a Grooper Batch Process, the Classify Activity will assign a Content Type to a folder Batch Folder.

Classify: unknown_document Classify is an Activity that "classifies" folder Batch Folders in a inventory_2 Batch by assigning them a description Document Type.

  • Classification is key to Grooper's document processing. It affects how data is extracted from a document (during the Extract activity) and how Behaviors are applied.
  • Classification logic is controlled by a Content Model's "Classify Method". These methods include using text patterns, previously trained document examples, and Label Sets to identify documents.

Content Model: stacks Content Model nodes define a classification taxonomy for document sets in Grooper. This taxonomy is defined by the collections_bookmark Content Categories and description Document Types they contain. Content Models serve as the root of a Content Type hierarchy, which defines Data Element inheritance and Behavior inheritance. Content Models are crucial for organizing documents for data extraction and more.

Correct: abc Correct is an Activity that performs spell correction. It can correct a folder Batch Folder's text content or specific Data Element values to resolve OCR errors, deidentify data or otherwise enhance text data.

Data Connection: database Data Connections connect Grooper to Microsoft SQL and supported ODBC databases. Once configured, Data Connections can be used to export data extracted from a document to a database, perform database lookups to validate data Grooper collects and other actions related to database management systems (DBMS).

  • Grooper supports MS SQL Server connectivity with the "SQL Server" connection method.
  • Grooper supports Oracle, PostgreSQL, Db2, and MySQL connectivity with the "ODBC" connection method.

Data Extraction: Data Extraction involves identifying and capturing specific information from documents (represented by folder Batch Folders in Grooper). Extraction is performed by configurable Data Extractors, which transform unstructured or semi-structured data into a structured, usable format for processing and analysis.

Data Type: pin Data Types are nodes used to extract text data from a document. Data Types have more capabilities than quick_reference_all Value Readers. Data Types can collect results from multiple extractor sources, including a locally defined extractor, child extractor nodes, and referenced extractor nodes. Data Types can also collate results using Collation Providers to combine, sift and manipulate results further.

Document Type: description Document Type nodes represent a distinct type of document, such as an invoice or a contract. Document Types are created as child nodes of a stacks Content Model or a collections_bookmark Content Category. They serve three primary purposes:

  1. They are used to classify documents. Documents are considered "classified" when the folder Batch Folder is assigned a Content Type (most typically, a Document Type).
  2. The Document Type's data_table Data Model defines the Data Elements extracted by the Extract activity (including any Data Elements inherited from parent Content Types).
  3. The Document Type defines all "Behaviors" that apply (whether from the Document Type's Behavior settings or those inherited from a parent Content Type).

Extract: export_notes Extract is an Activity that retrieves information from folder Batch Folder documents, as defined by Data Elements in a data_table Data Model. This is how Grooper locates unstructured data on your documents and collects it in a structured, usable format.

Lexicon: dictionary Lexicons are dictionaries used throughout Grooper to store lists of words, phrases, weightings for Fuzzy RegEx, and more. Users can add entries to a Lexicon, Lexicons can import entries from other Lexicons by referencing them, and entries can be dynamically imported from a database using a database Data Connection. Lexicons are commonly used to aid in data extraction, with the "List Match" and "Word Match" extractors utilizing them most commonly.

List Match: List Match is a Value Extractor designed to return values matching one or more items in a defined list. By default, the List Match extractor does not use or require regular expression, but can be configured to utilize regular expression syntax.

Lookup: A Lookup Specification defines a "lookup operation", where existing Grooper fields (called "lookup fields") are used to query an external data source, such as a database. The results of the lookup can be used to validate or populate field values (called "target fields") in Grooper. Lookup Specifications are created on "container elements" (data_table Data Models, insert_page_break Data Sections and table Data Tables) using their Lookups property. Lookups may query using all single-instance fields relative to the container element (including those defined on parent elements up to the root Data Model), but cannot be used to populate a field value on a parent of the container element.

OCR: OCR is stands for Optical Character Recognition. It allows text on paper documents to be digitized, in order to be searched or edited by other software applications. OCR converts typed or printed text from digital images of physical documents into machine readable, encoded text.

Pattern Match: Pattern Match is a Value Extractor that extracts values from a document that match a specified regular expression, providing data collection following a known format or pattern.

Project: package_2 Projects are the primary containers for configuration nodes within Grooper. The Project is where various processing objects such as stacks Content Models, settings Batch Processes, profile objects are stored. This makes resources easier to manage, easier to save, and simplifies how node references are made in a Grooper Repository.

Scope: The Scope property of a edit_document Batch Process Step, as it relates to an Activity, determines at which level in a inventory_2 Batch hierarchy the Activity runs.

Value Reader: quick_reference_all Value Reader nodes define a single data extraction operation. Each Value Reader executes a single Value Extractor configuration. The Value Extractor determines the logic for returning data from a text-based document or page. (Example: Pattern Match is a Value Extractor that returns data using regular expressions).

  • Value Readers are can be used on their own or in conjunction with pin Data Types for more complex data extraction and collation.