Training-Based Approaches to Document Classification (Concept): Difference between revisions

Revision as of 13:28, 10 May 2024

"Training-Based Approaches to Document Classification" refers to Grooper Classify Methods that classify folder Batch Folders using document examples for each description Document Type. The Classify activity then assigns unclassified Batch Folders a Document Type based on how similar it is to the Document Type's training data.

There are two training based approaches in Grooper.

Lexical

This classification method trains text features (words and phrases) of examples documents. Document samples are trained as examples of a Document Type of a Content Model. Training occurs via user supervised machine learning using the TF-IDF algorithm. The Lexical method's Text Feature Extractor returns words, phrases or other results to provide possible identifiers used to classify a document. These identifiers (the words, phrases or other results from the Data Extractor used) are called "features." Document training uses TF-IDF to assign weightings to those features. During classification, Grooper looks at the weightings list for the various trained Document Types and compares them to the text features on the current document to be classified. The document is then assigned a percentage similarity score to each possible Document Type match. Whichever Document Type has the highest percentage similarity is used to classify the document.
Note: This is the most common method. It is so common "training based approach" and "Lexical classification" are often used interchangeably.

Visual

The Visual classification method uses image data instead of text data to determine the Document Type. Instead of using text extractors, an IP Profile will be set with an Extract Features command to get data pertaining to a document's image. Document samples are trained as examples of a Document Type.
Note: While this is a much less commonly used method, it is still technically a training based approach to classification.

Glossary

Batch Folder: The folder Batch Folder is an organizational unit within a inventory_2 Batch, allowing for a structured approach to managing and processing a collection of documents. Batch Folder nodes serve two purposes in a Batch. (1) Primarily, they represent "documents" in Grooper. (2) They can also serve more generally as folders, holding other Batch Folders and/or contract Batch Page nodes as children.

Batch Folders are frequently referred to simply as "documents" or "folders" depending on how they are used in the Batch.

Classification: Classification is the process of identifying and organizing documents into categorical types based on their content or layout. Classification is key for efficient document management and data extraction workflows. Grooper has different methods for classifying documents. These include methods that use machine learning and text pattern recognition. In a Grooper Batch Process, the Classify Activity will assign a Content Type to a folder Batch Folder.

Content Model: stacks Content Model nodes define a classification taxonomy for document sets in Grooper. This taxonomy is defined by the collections_bookmark Content Categories and description Document Types they contain. Content Models serve as the root of a Content Type hierarchy, which defines Data Element inheritance and Behavior inheritance. Content Models are crucial for organizing documents for data extraction and more.

Data Extractor: Data Extractor (or just "extractor") refers to all Value Extractors and Extractor Nodes. Extractors define the logic used to return data from a document's text content, including general data (such as a date) and specific data (such as an agreement date on a contract).

Document Type: description Document Type nodes represent a distinct type of document, such as an invoice or a contract. Document Types are created as child nodes of a stacks Content Model or a collections_bookmark Content Category. They serve three primary purposes:

They are used to classify documents. Documents are considered "classified" when the folder Batch Folder is assigned a Content Type (most typically, a Document Type).
The Document Type's data_table Data Model defines the Data Elements extracted by the Extract activity (including any Data Elements inherited from parent Content Types).
The Document Type defines all "Behaviors" that apply (whether from the Document Type's Behavior settings or those inherited from a parent Content Type).

Extract: export_notes Extract is an Activity that retrieves information from folder Batch Folder documents, as defined by Data Elements in a data_table Data Model. This is how Grooper locates unstructured data on your documents and collects it in a structured, usable format.

IP Profile: perm_media IP Profiles are a step-by-step list of image processing operations (IP Commands). They are used for several image processing related operations, but primarily for:

Permanently enhancing an image during the Image Processing activity (usually to get rid of defects in a scanned image, such as skewing or borders).
Cleaning up an image in-memory during the Recognize activity without altering the image to improve OCR accuracy.
Computer vision operations that collect layout data (table line locations, OMR checkboxes, barcode value and more) utilized in data extraction.

Lexical: "Lexical" is a Classify Method that classifies folder Batch Folders based on the text content of trained document examples. This is achieved through the statistical analysis of word frequencies that identify description Document Types.

TF-IDF: TF-IDF stands for term frequency-inverse document frequency. It is a statistical calculation intended to reflect how important a word is to a document within a document set (or "corpus"). It is how Grooper uses machine learning for training-based document classification (via the Lexical method) and data extraction (via the input Field Class extractor).

Training-Based Approaches to Document Classification: "Training-Based Approaches to Document Classification" refers to Grooper Classify Methods that classify folder Batch Folders using document examples for each description Document Type. The Classify activity then assigns unclassified Batch Folders a Document Type based on how similar it is to the Document Type's training data.

Visual: "Visual" is a Classify Method that uses image analysis instead of text data to determine the description Document Type assigned to a folder Batch Folder during classification. Instead of using text-based extractors, an "Extract Features" IP Command in an perm_media IP Profile is used to collect image-based data from a Batch Folder's image(s). This image-based data is compared against that of previously trained document examples of each Document Type to classify the Batch Folder.

@@ Line 10: / Line 10: @@
 * The '''''Visual''''' classification method uses image data instead of text data to determine the '''Document Type'''.  Instead of using text extractors, an '''[[IP Profile]]''' will be set with an '''Extract Features''' command to get data pertaining to a document's image.  Document samples are trained as examples of a '''Document Type'''.
 * ''Note: While this is a much less commonly used method, it is still technically a training based approach to classification.''
+== Glossary ==
+<u><big>'''Batch Folder'''</big></u>: {{#lst:Glossary|Batch Folder}}
+<u><big>'''Classification'''</big></u>: {{#lst:Glossary|Classification}}
+<u><big>'''Content Model'''</big></u>: {{#lst:Glossary|Content Model}}
+<u><big>'''Data Extractor'''</big></u>: {{#lst:Glossary|Data Extractor}}
+<u><big>'''Document Type'''</big></u>: {{#lst:Glossary|Document Type}}
+<u><big>'''Extract'''</big></u>: {{#lst:Glossary|Extract}}
+<u><big>'''IP Profile'''</big></u>: {{#lst:Glossary|IP Profile}}
+<u><big>'''Lexical'''</big></u>: {{#lst:Glossary|Lexical}}
+<u><big>'''TF-IDF'''</big></u>: {{#lst:Glossary|TF-IDF}}
+<u><big>'''Training-Based Approaches to Document Classification'''</big></u>: {{#lst:Glossary|Training-Based Approaches to Document Classification}}
+<u><big>'''Visual'''</big></u>: {{#lst:Glossary|Visual}}