2.72:Classification Mockup - RP
Classification is the process of assigning a Document Type (or other Content Type) to a document in a Batch. A document must be assigned a Document Type for Grooper to know what to do with the document.
About
Let's revisit the first three of the five phases of Grooper.
- Acquire
- This involves bringing in a Batch into Grooper. Usually, documents are scanned into Grooper and the initial Batch looks like just one long document with individual pages.
- Condition
- This involves running Recognize and OCR on the Batch to allow Grooper to read the text and clean up the document if needed.
- Organize
- This is where you separate the documents in the Batch into individual folders.
- After the documents have been separated, then the documents need to go through Classification.
Computer programs have no sense of intuition. Unless we tell Grooper that an invoice is an invoice, it won't know the difference between one document or another. This becomes problematic if we want to extract information from two different types of documents contained within the same Batch. If we want to extract names from a patient intake form and dollar amounts from an Explanation of Benefits document that are in the same Batch, we have to tell Grooper which document is which so it extracts the correct information.
We assign a Content Type, such as a Document Type, to each document so Grooper knows that if a document has been assigned X Document Type then it needs to do Y with it. The process of assigning the Document Type to a document is called Classification.
You can assign a Document Type to documents manually or, with the functionality in Grooper, we can automate this process. There are four different Classification Methods used to automate the Classification process:
- Rules-Based Classification
- Lexical Classification
- Visual Classification
- Labelset-Based Classification
Which method you use generally depends on the type of documents you have within your Batch. Some methods lend themselves better to more structured documents like invoices or EOBs rather than unstructured documents like letters or leases. Let's go through each Classification Method individually.
| ⚠ |
Before you can start configuring classification, you need to set the Classification Method on the Content Model. If this isn't set, Grooper won't know which method of classification to use on the Batch. |
Rules-Based Classification
|
FYI |
Rules-Based Classification works best on structured or semi-structured documents. For unstructured documents, it might be more advantageous to use Lexical Classification or a mixture of both Rules-Based and Lexical Classification Methods. |
How do you tell what a document is? You might notice the document has a specific title or certain wording that is specific to that type of document. For example, you might expect to find an "Invoice Date" label on an invoice, but not on an Explanation of Benefits form. On a Federal W-4, you might actually see "W-4" listed as a title of the document.
You can tell Grooper to classify any document that has an "Invoice Date" label as an invoice or that a document is a W-4 if it has "W-4" as part of the document. We do this by setting a Positive Extractor on each Data Type. If the Positive Extractor returns at least one result, the document will be classified as that Data Type.
What if you run into a situation where the Positive Extractor is returning a result on two or more different documents, and there really isn't another good option to choose from for your extractor? You can use a Negative Extractor in addition to tell Grooper which documents should not be classified. Let's say that we have two W-4 documents, but one is a Federal W-4 and the other is an Iowa W-4. For the Federal W-4, we might set the Positive Extractor to capture the expression "W-4". We might then set the Negative Extractor to "Iowa" so Grooper knows that if the word "Iowa" appears on the document, then it should not be classified as a Federal W-4.
Using a combination of Positive Extractors and Negative Extractors, you can generally do a pretty good job of classifying structured or semi-structured documents.
Lexical Classification
|
FYI |
Lexical Classification can work well for unstructured documents or documents that you have difficulty classifying using the Rules-Based method. You can also combine both Lexical and Rules-Based Classification to improve your results. |
While labels or titles on a document can give a good indication of what the document is, we do not always have that information available. This is especially true on unstructured documents. So, how do we tell documents apart in this type of scenario?
Generally, documents, even unstructured documents, have different language in them. You'd be more likely to see the word "oil" or "lease" on an oil and gas lease document than you would on a cover letter for a job. Using word frequency, we can train Grooper to recognize documents as different Document Types.
The algorithm that's used to train Grooper on how to classify documents is known as Term Frequency-Inverse Document Frequency or TF-IDF. For more information on how this works, please see our TF-IDF article.
Visual Classification
|
FYI |
Visual Classification generally only works for highly structured document. Documents of the same type need to be visually similar to each other and visually different from other types. |
Visual Classification is different than the previous two types because it does not involve the language of the document. Rather, it involves the structure and overall look of the document. Grooper takes a look at the concentration of pixels and how they are arranged on a document to make a determination about what the Document Type should be.
We can see that these two documents have significantly different layouts.
Grooper can look at how pixels are grouped on these documents to tell the difference between the two.
However, if two documents look too similar Grooper may not be able to differentiate between the two. This is the downside to Visual Classification. This method should only really be used if you know you have documents that significantly differ from one another in their layout.
Labelset-Based Classification
|
FYI |
Labelset-Based Classification generally works best with structured and semi-structured document. Labelset-Based Classification relies on documents of the same type having similar labels and for Labelsets to be used for extraction. For more information on Labelsets, take a look at our Labeling Behavior article. |
Labels are important for understanding data on a document. The way we can tell the difference between an invoice date and an order date on an invoice is by the labels on the document. We can also often tell what type of document we are working with based on the document's labels. You might expect an "Invoice Number" label on an invoice, but you wouldn't expect the same label to be on an Explanation of Benefits (EOB) document. In this way, labels can be used to help classify that document.





