Field Class (Node Type)
|
STUB |
This article is a stub. It contains minimal information on the topic and should be expanded. |
input Field Classes are NLP (natural language processing) based extractor nodes. They find values based on some natural language context near that value. Values are positively or negatively associated with text-based "features" nearby by training the extractor. During extraction, the extractor collects values based on these training weightings.
- Field Classes are most useful when attempting to find values within the flow of natural language.
- Field Classes can be configured to distinguish values within highly structured documents, but this type of extraction is better suited to simpler "extractor nodes" like quick_reference_all Value Readers or pin Data Types.
- Advances in large-language models (LLMs) have largely made Field Classes obsolete. LLM-based extraction methods in Grooper (such as AI Extract) can achieve similar results with nowhere near the amount of set up.
Field Classes use two Data Extractors to do this:
- A Value Extractor
- and a Feature Extractor
The Value Extractor finds specified output. There can be multiple possible values (candidates) returned by the Value Extractor. To find the context that differentiates the right candidate from the wrong one, the Feature Extractor is written to return words, phrases or other labels that can identify the value in question. From the list of value candidates, the correct value is trained as a positive candidate. The features around it returned by the Feature Extractor are given positive weightings using a TF-IDF algorithm. The extractor will use the weightings of these features on other documents to identify the correct value.
As with any extractor, data context can be critical to understanding your documents and building the Field Class extractor. For more information on this topic, visit the Data Context article.
Glossary
Data Context: Data Context refers to contextual information used to extract data, such as a label that identifies the value you want to collect.
Data Extractor: Data Extractor (or just "extractor") refers to all Value Extractors and Extractor Nodes. Extractors define the logic used to return data from a document's text content, including general data (such as a date) and specific data (such as an agreement date on a contract).
Data Type: pin Data Types are nodes used to extract text data from a document. Data Types have more capabilities than quick_reference_all Value Readers. Data Types can collect results from multiple extractor sources, including a locally defined extractor, child extractor nodes, and referenced extractor nodes. Data Types can also collate results using Collation Providers to combine, sift and manipulate results further.
Extract: export_notes Extract is an Activity that retrieves information from folder Batch Folder documents, as defined by Data Elements in a data_table Data Model. This is how Grooper locates unstructured data on your documents and collects it in a structured, usable format.
Field Class: input Field Classes are NLP (natural language processing) based extractor nodes. They find values based on some natural language context near that value. Values are positively or negatively associated with text-based "features" nearby by training the extractor. During extraction, the extractor collects values based on these training weightings.
- Field Classes are most useful when attempting to find values within the flow of natural language.
- Field Classes can be configured to distinguish values within highly structured documents, but this type of extraction is better suited to simpler "extractor nodes" like quick_reference_all Value Readers or pin Data Types.
- Advances in large-language models (LLMs) have largely made Field Classes obsolete. LLM-based extraction methods in Grooper (such as AI Extract) can achieve similar results with nowhere near the amount of set up.
TF-IDF: TF-IDF stands for term frequency-inverse document frequency. It is a statistical calculation intended to reflect how important a word is to a document within a document set (or "corpus"). It is how Grooper uses machine learning for training-based document classification (via the Lexical method) and data extraction (via the input Field Class extractor).
Value Reader: quick_reference_all Value Reader nodes define a single data extraction operation. Each Value Reader executes a single Value Extractor configuration. The Value Extractor determines the logic for returning data from a text-based document or page. (Example: Pattern Match is a Value Extractor that returns data using regular expressions).
- Value Readers are can be used on their own or in conjunction with pin Data Types for more complex data extraction and collation.