Lexical (Classify Method)

The Lexical Classification Method classifies documents according to their text content, obtained from OCR or extracted native PDF text. It uses a Training-Based Approach to teach Grooper to classify a document from trained examples of the Document Type.

A Text Feature Extractor is set to extract values from document samples to be used as identifiable features of the document (such as words or phrases). These features are given weightings according to the TF-IDF algorithm. Features are given a higher weighting the more they appear on a document, mitigated by if that feature is common to multiple Document Types. During a Classify activity, the features of an unclassified document are compared to the weighted features of the trained Document Types. The document is assigned the Document Type it is most similar to.

Furthermore, a Rules-Based Approach can be taken in combination with the training based approach. This can be done by setting a positive extractor on the Document Type. If the extractor yields a result, the document will be classified as that type without being compared to training examples. This way, if you have a value that can be extracted that you know is going to be on a Document Type (such as a header title), you can take advantage of setting a positive extractor on the Document Type to classify them. But, if that extractor fails for whatever reason, you have training data which can act as a backup classification.