2.72:What is Classification - DSmith: Difference between revisions
| Line 30: | Line 30: | ||
Unfortunately, not every document is structured, or has labels to help out both humans and Grooper with classification. This is where Lexical Classification comes in. Here, we have some academic papers on statistics. | Unfortunately, not every document is structured, or has labels to help out both humans and Grooper with classification. This is where Lexical Classification comes in. Here, we have some academic papers on statistics. | ||
{| | {| | ||
[[File:stat_paper.png]] | [[File:stat_paper.png|275x315px]] | ||
| | |||
|- | |- | ||
[[File:stat_paper_2.png|275x315px]] | |||
[[File:stat_paper_2.png ]] | |||
|} | |} | ||
To a human eye, they have structure, but not in a way that Grooper can make use of with Visual or Rules-Based Classification; and there's certainly no labels for Labelset-Based to make use of. Thus, we can use Lexical Classification. Perhaps have Grooper look for repeating words, like "statistics" or "statistical method" | To a human eye, they have structure, but not in a way that Grooper can make use of with Visual or Rules-Based Classification; and there's certainly no labels for Labelset-Based to make use of. Thus, we can use Lexical Classification. Perhaps have Grooper look for repeating words, like "statistics" or "statistical method" | ||
Revision as of 15:32, 10 January 2024
Overview
Classification is an Activity in Grooper that allows the assigning of a Content Type to a Document. While we as humans may be able to classify a document by reading it (or its title should it have one), to Grooper all documents that come in are unclassified, or "blank". If we want Grooper to know what a Purchase Order is, or be able to tell the difference between a Purchase Order and an invoice, we have to tell it; and we do that through Classification.
Classification Methods
In order to classify a document, you must choose between four different Classification Methods. They are:
- Rules-Based
- Lableset-Based
- Lexical
- Visual
These methods can be set on the Content Model via the Classification Method property. Whatever method you choose is largely based on what sort of document you have; its structure, complexity, so on and so forth. We will provide a brief overview of each Classification Method here.
For more detailed information about each Classification Method, click the following links:
Rules-Based
Rules-Based Classification works by using classification rules set up on a Document Type. What exactly are these rules? Whatever you tell Grooper they are. To elaborate, let's say you have Batch of documents that consists of Invoices, Purchase Orders, and Guest Speaker Agreements. A human would be able to tell the differences between these documents simply by reading their titles. To Grooper, they are all just documents covered in pixels. Unless of course, we tell Grooper how to tell between each of the three types of document. We do this by setting up a Document Type for each of the three documents (Invoice, PO, Guest Speaker Agreement) and telling Grooper what it needs to look for to be able to tell the difference between the three. For example, you can tell Grooper via Positive Extractor on the Document Type that an invoice will have will have a label such as an Invoice Number, a Purchase Order will have a Purchase Order number, and a Guest Speaker Agreement will be titled as such.
Labelset-Based
Labels are a staple of semi-structured documents. Labels can be used to help identify various pieces of information on a document. We can then use those labels for Labelset-Based Classification to help Grooper classify documents. To continue with our previous example, you would naturally expect purchase orders and invoices to have different labels with which they organize their content.
Lexical
Unfortunately, not every document is structured, or has labels to help out both humans and Grooper with classification. This is where Lexical Classification comes in. Here, we have some academic papers on statistics.
To a human eye, they have structure, but not in a way that Grooper can make use of with Visual or Rules-Based Classification; and there's certainly no labels for Labelset-Based to make use of. Thus, we can use Lexical Classification. Perhaps have Grooper look for repeating words, like "statistics" or "statistical method"
Visual
Visual Classification is different. Unlike the previous three methods mentioned here, Visual Classification relies upon the structure of the document itself rather than the language present on the document. Take a look at these two documents here. Instead of focusing on labels, titles, or a piece of recurring text, we can have Grooper concentrate on how the pixels are grouped together and classify documents that way.
|
We can tell Grooper that documents structured like this are invoices |
||
|
And documents structured like this are legal documents. |
||
|
Unfortunately, if our documents are similar in structure, then Grooper will have difficulty classifying them, and may even classify them as the same document. Such is the downside to Visual Classification. |

