Training-Based Approaches to Document Classification (Concept): Difference between revisions

Revision as of 09:55, 14 October 2020

A training based approach classifies documents according to the similarity of unclassified documents to trained examples of that kind of document (or Document Type from Grooper's perspective)

There are two training based approaches in Grooper.

Lexical
- This classification method trains text features (words and phrases) of examples documents

Document samples are trained as examples of a Document Type of a Content Model. Training occurs via user supervised machine learning using the TF-IDF algorithm. A Data Extractor set on the Text Feature Extractor property returns words, phrases or other results from the extractor to provide possible identifiers used to classify a document. These identifiers (the words, phrases or other results from the Data Extractor used) are called "features." Document training uses TF-IDF to assign weightings to those features. During classification, Grooper looks at the weightings list for the various trained Document Types and compares them to the features extracted by the Text Feature Extractor on the current document to be classified. The document is then assigned a percentage similarity score to each possible Document Type match. Whichever Document Type has the highest percentage similarity is used to classify the document.

@@ Line 1: / Line 1: @@
-Document samples are trained as examples of a [[Document Type]].  Training occurs via user supervised machine learning using the [[TF-IDF]] algorithm.  A [[Data Extractor]] set on the [[Text Feature Extractor]] property returns words, phrases or other results from the extractor to provide possible identifiers used to classify a document.  These identifiers (the words, phrases or other results from the Data Extractor used) are called "features."  Document training uses [[TF-IDF]] to assign weightings to those features.  During classification, Grooper looks at the weightings list for the various trained [[Document Type]]s and compares them to the features extracted by the [[Text Feature Extractor]] on the current document to be classified.  The document is then assigned a percentage similarity score to each possible [[Document Type]] match.  Whichever [[Document Type]] has the highest percentage similarity is used to classify the document.
+A training based approach classifies documents according to the similarity of unclassified documents to trained examples of that kind of document (or '''Document Type''' from Grooper's perspective)
+There are two training based approaches in Grooper.
+* [[Lexical (Classification Method)|Lexical]]
+** This classification method trains text features (words and phrases) of examples documents
+Document samples are trained as examples of a '''[[Document Type]]''' of a '''[[Content Model]]'''.  Training occurs via user supervised machine learning using the [[TF-IDF]] algorithm.  A [[Data Extractor]] set on the [[Text Feature Extractor]] property returns words, phrases or other results from the extractor to provide possible identifiers used to classify a document.  These identifiers (the words, phrases or other results from the Data Extractor used) are called "features."  Document training uses [[TF-IDF]] to assign weightings to those features.  During classification, Grooper looks at the weightings list for the various trained [[Document Type]]s and compares them to the features extracted by the [[Text Feature Extractor]] on the current document to be classified.  The document is then assigned a percentage similarity score to each possible [[Document Type]] match.  Whichever [[Document Type]] has the highest percentage similarity is used to classify the document.