2.72:Classification Mockup - RP

From Grooper Wiki

Classification is the process of assigning a Document Type to an unclassified document folder in a Batch. A document folder must be assigned a Document Type for Grooper to know what to do with the document it contains.

About

Let's revisit the first three of the five phases of Grooper.

  1. Acquire
    • Either physical pages are scanned into Grooper or digital files are imported into a Batch in Grooper.
  2. Condition
    • This involves running Recognize and OCR on the Batch to allow Grooper to read the text and clean up the pages if needed.
  3. Organize
    • This is where you separate the pages in the Batch into individual document folders.
    • After the pages have been separated, then the document folders are Classified.
  4. Collect
    • Data is extracted from the documents.
  5. Deliver
    • The extracted data is exported from Grooper to the destination of your choice.

Separation is part of the third phase: Organize.

Computer programs have no sense of intuition. Unless we tell Grooper that an invoice is an invoice, it won't know the difference between an invoice and any other document like a college transcript. This becomes problematic if we want to extract information from two different types of documents contained within the same Batch.

If we want to extract names from a patient intake form and dollar amounts from an Explanation of Benefits (EOB) that are in the same Batch, we have to tell Grooper which document is which so it extracts the correct information.

We assign a Document Type to each document folder so Grooper knows that if a document has been assigned X Document Type then it needs to do Y with it. The process of assigning the Document Type to a document is called Classification.

All documents come into Grooper unclassified, so classification is ALWAYS required to move to the next phase of Grooper. Grooper cannot extract information from documents that have not been Classified.

Classification Methods

Before configuring classification, we need to make sure we have a Classification Method set. This property is found on the Content Model. If we do not set a Classification Method then Grooper won't know how we want to classify the documents.

Documents are actually classified during the Classify Step of a published Batch Process. The Classify Step looks at the Content Model to determine which Classification Method to use.

There are four different Classification Methods used for Classification:

  1. Rules-Based Classification
  2. Lexical Classification
  3. Visual Classification
  4. Labelset-Based Classification

Which method you use generally depends on the type of documents you have within your Batch. Some methods lend themselves better to more structured or semi-structured documents like W-4s or invoices rather than unstructured documents like letters or leases.

Let's go through each Classification Method individually.

Rules-Based Classification

FYI

Rules-Based Classification works best on structured or semi-structured documents. For unstructured documents, it might be more advantageous to use Lexical Classification or a mixture of both Rules-Based and Lexical Classification Methods.

How do you tell what a document is? You might notice the document has a specific title or certain wording that is specific to that type of document. For example, you might expect to find an "Invoice Date" label on an invoice, but not on an Explanation of Benefits form. On a Federal W-4, you might actually see "W-4" listed as a title of the document.

You can tell Grooper to classify any document that has an "Invoice Date" label as an invoice or any document that has "W-4" on it as a W-4. We do this by setting a Positive Extractor on each Document Type. If the Positive Extractor returns at least one result, the document will be classified as that Document Type.

For more details on how to set up Rules Based Classification, please see the Rules Based (Classification Method) article.

Lexical Classification

FYI

Lexical Classification can work well for most types of documents, both structured and unstructured. If the Rules-Based method won't give you the results you want, you can try Lexical classification. You can also combine both Lexical and Rules-Based Classification to improve your results.

While labels or titles on a document can give a good indication of what the document is, we do not always have that information available. This is especially true on unstructured documents. So, how do we tell documents apart in this type of scenario?

Generally, documents, even unstructured documents, have different language in them. You'd be more likely to see the word "oil" or "lease" on an oil and gas lease document than you would on W-4. Using word frequency, we can train Grooper to recognize documents as different Document Types.

Below we can see the first two Oil & Gas Leases we see the word "oil" and "lease" fairly frequently throughout the document, whereas the W-4 only has one instance of "lease" and that is as part of the word "release". Looking at the language alone, we can determine which documents are the Oil & Gas Leases.

The algorithm that's used to train Grooper on how to classify documents is known as Term Frequency-Inverse Document Frequency or TF-IDF. For more information on how this works, please see our TF-IDF article.

For more details on how to set up Lexical Classification, please see the Lexical (Classification Method) article.

Visual Classification

FYI

Visual Classification generally only works for highly structured document. Documents of the same type need to be visually similar to each other and visually different from other types.

Visual Classification is different than the previous two types because it does not involve the language of the document. Rather, it involves the structure and overall look of the document. Grooper takes a look at the concentration of pixels and how they are arranged on a document to make a determination about what the Document Type should be.

Here we have a Federal W-4 and an Iowa State W-4. We can see that these two highly structured documents have significantly different layouts.

Visual Classification requires an IP Profile with a configured Extract Features step. For more information on IP Profiles, take a look at our IP Profile article.

The Extract Features step will binarize and intensify the images. It will take those images and analyze the pixels and create a grid pattern Grooper can better understand.

The two documents below are the result when the documents above (the Federal W-4 and the Iowa W-4) have had the Extract Features step applied. We can see a distinct difference between the two images. Grooper will be able to tell the two images apart based on the variations of color in the grid layout.

However, if two different types of documents look too similar, Grooper may not be able to differentiate between the two. This is the downside to Visual Classification. This method should only really be used if you know your documents of different types significantly differ from one another in their layout.

For more details on how to set up Rules Based Classification, please see the Visual (Classification Method) article.

Labelset-Based Classification

FYI

Labelset-Based Classification generally works best with semi-structured documents. Labelset-Based Classification relies on documents of the same type having similar labels and for Labelsets to be used for extraction. For more information on Labelsets, take a look at our Labeling Behavior article.

Labels are important for understanding data on a document. Even among similar documents, labels that reference the same type of data may be different.

For example, take a look at the two invoices below. Both of these documents have an Invoice Number and an Invoice Date. However, the labels that indicate these fields are different.

On the Stuff and Things invoice, the invoice number field has "Invoice Number:" as its label, whereas the Envoy invoice has "Invoice" as its label for the same information. We can tell these two invoices apart based on what labels are used to collect the same information. In this way, labels can be used to help classify that document.

For more details on how to set up Labelset-Based Classification, please see the Labelset-Based (Classification Method) article.