Training-Based Approaches to Document Classification (Concept): Difference between revisions

Latest revision as of 09:10, 25 June 2025

"Training-Based Approaches to Document Classification" refers to Grooper Classify Methods that classify folder Batch Folders using document examples for each description Document Type. The Classify activity then assigns unclassified Batch Folders a Document Type based on how similar it is to the Document Type's training data.

There are two training based approaches in Grooper.

Lexical

This classification method trains text features (words and phrases) of examples documents. Document samples are trained as examples of a Document Type of a Content Model. Training occurs via user supervised machine learning using the TF-IDF algorithm. The Lexical method's Text Feature Extractor returns words, phrases or other results to provide possible identifiers used to classify a document. These identifiers (the words, phrases or other results from the Data Extractor used) are called "features." Document training uses TF-IDF to assign weightings to those features. During classification, Grooper looks at the weightings list for the various trained Document Types and compares them to the text features on the current document to be classified. The document is then assigned a percentage similarity score to each possible Document Type match. Whichever Document Type has the highest percentage similarity is used to classify the document.
Note: This is the most common method. It is so common "training based approach" and "Lexical classification" are often used interchangeably.

Visual

The Visual classification method uses image data instead of text data to determine the Document Type. Instead of using text extractors, an IP Profile will be set with an Extract Features command to get data pertaining to a document's image. Document samples are trained as examples of a Document Type.
Note: While this is a much less commonly used method, it is still technically a training based approach to classification.

@@ Line 1: / Line 1: @@
-<section begin="glossary" />
+<blockquote>{{#lst:Glossary|Training-Based Approaches to Document Classification}}</blockquote>
-<blockquote>
-A training based approach to document [[Classification|classification]] classifies documents according to the similarity of unclassified documents to trained examples of that kind of document (or '''[[Document Type]]''' from Grooper's perspective).
-</blockquote>
-<section end="glossary" />
 There are two training based approaches in Grooper.
-'''''[[Lexical (Classification Method)|Lexical]]'''''
+'''''[[Lexical]]'''''
 * This classification method trains text features (words and phrases) of examples documents.  Document samples are trained as examples of a '''Document Type''' of a '''[[Content Model]]'''.  Training occurs via user supervised machine learning using the [[TF-IDF]] algorithm.  The '''''Lexical''''' method's '''''Text Feature Extractor''''' returns words, phrases or other results to provide possible identifiers used to classify a document.  These identifiers (the words, phrases or other results from the Data Extractor used) are called "features."  Document training uses [[TF-IDF]] to assign weightings to those features.  During classification, Grooper looks at the weightings list for the various trained '''Document Types''' and compares them to the text features on the current document to be classified.  The document is then assigned a percentage similarity score to each possible '''Document Type''' match.  Whichever '''Document Type''' has the highest percentage similarity is used to classify the document.
 * ''Note: This is the most common method.  It is so common "training based approach" and "Lexical classification" are often used interchangeably.''
-'''''[[Visual (Classification Method)|Visual]]'''''
+'''''[[Visual]]'''''
 * The '''''Visual''''' classification method uses image data instead of text data to determine the '''Document Type'''.  Instead of using text extractors, an '''[[IP Profile]]''' will be set with an '''Extract Features''' command to get data pertaining to a document's image.  Document samples are trained as examples of a '''Document Type'''.
 * ''Note: While this is a much less commonly used method, it is still technically a training based approach to classification.''
-[[Category:Articles]]