2.72:Classification Mockup - RP

From Grooper Wiki
Revision as of 10:00, 16 January 2024 by Rpatton (talk | contribs) (Lexical Update // Edit via Wikitext Extension for VSCode)

Classification is the process of assigning a Document Type to an unclassified document folder in a Batch. A document folder must be assigned a Document Type for Grooper to know what to do with the document it contains.

About

Let's revisit the first three of the five phases of Grooper.

  1. Acquire
    • Either physical pages are scanned into Grooper or digital files are imported into a Batch in Grooper.
  2. Condition
    • This involves running Recognize and OCR on the Batch to allow Grooper to read the text and clean up the pages if needed.
  3. Organize
    • This is where you separate the pages in the Batch into individual document folders.
    • After the pages have been separated, then the document folders are Classified.
  4. Collect
    • Data is extracted from the documents.
  5. Deliver
    • The extracted data is exported from Grooper to the destination of your choice.

Computer programs have no sense of intuition. Unless we tell Grooper that an invoice is an invoice, it won't know the difference between an invoice and any other document like a college transcript. This becomes problematic if we want to extract information from two different types of documents contained within the same Batch.

If we want to extract names from a patient intake form and dollar amounts from an Explanation of Benefits (EOB) that are in the same Batch, we have to tell Grooper which document is which so it extracts the correct information.

We assign a Document Type to each document folder so Grooper knows that if a document has been assigned X Document Type then it needs to do Y with it. The process of assigning the Document Type to a document is called Classification.

All documents come into Grooper unclassified, so classification is ALWAYS required to move to the next phase of Grooper.

Classification Methods

Before configuring classification, we need to make sure we have a Classification Method set. This property is found on the Content Model. If we do not set a Classification Method then Grooper won't know how we want to classify the documents.

Documents are actually classified during the Classify Step of a published Batch Process. The Classify Step looks at the Content Model to determine which Classification Method to use.

There are four different Classification Methods used for Classification:

  1. Rules-Based Classification
  2. Lexical Classification
  3. Visual Classification
  4. Labelset-Based Classification

Which method you use generally depends on the type of documents you have within your Batch. Some methods lend themselves better to more structured documents like invoices or EOBs rather than unstructured documents like letters or leases.

Let's go through each Classification Method individually.

Rules-Based Classification

FYI

Rules-Based Classification works best on structured or semi-structured documents. For unstructured documents, it might be more advantageous to use Lexical Classification or a mixture of both Rules-Based and Lexical Classification Methods.

How do you tell what a document is? You might notice the document has a specific title or certain wording that is specific to that type of document. For example, you might expect to find an "Invoice Date" label on an invoice, but not on an Explanation of Benefits form. On a Federal W-4, you might actually see "W-4" listed as a title of the document.

You can tell Grooper to classify any document that has an "Invoice Date" label as an invoice or any document that has "W-4" on it as a W-4. We do this by setting a Positive Extractor on each Document Type. If the Positive Extractor returns at least one result, the document will be classified as that Document Type.


Lexical Classification

FYI

Lexical Classification can work well for most types of documents, both structured and unstructured. If the Rules-Based method won't give you the results you want, you can try Lexical classification. You can also combine both Lexical and Rules-Based Classification to improve your results.

While labels or titles on a document can give a good indication of what the document is, we do not always have that information available. This is especially true on unstructured documents. So, how do we tell documents apart in this type of scenario?

Generally, documents, even unstructured documents, have different language in them. You'd be more likely to see the word "oil" or "lease" on an oil and gas lease document than you would on W-4. Using word frequency, we can train Grooper to recognize documents as different Document Types.

   In the below documents, we see the first two Oil & Gas Leases we see the word "oil" and "lease" fairly frequently throughout the document, whereas the W-4 only has one instance of "lease" and that is as part of the word "release". Looking at the language alone, we can determine which documents are the Oil & Gas Leases. 

The algorithm that's used to train Grooper on how to classify documents is known as Term Frequency-Inverse Document Frequency or TF-IDF. For more information on how this works, please see our TF-IDF article.

Visual Classification

FYI

Visual Classification generally only works for highly structured document. Documents of the same type need to be visually similar to each other and visually different from other types.

Visual Classification is different than the previous two types because it does not involve the language of the document. Rather, it involves the structure and overall look of the document. Grooper takes a look at the concentration of pixels and how they are arranged on a document to make a determination about what the Document Type should be.

We can see that these two documents have significantly different layouts.

Grooper can look at how pixels are grouped on these documents to tell the difference between the two.

However, if two documents look too similar Grooper may not be able to differentiate between the two. This is the downside to Visual Classification. This method should only really be used if you know you have documents that significantly differ from one another in their layout.

Labelset-Based Classification

FYI

Labelset-Based Classification generally works best with structured and semi-structured document. Labelset-Based Classification relies on documents of the same type having similar labels and for Labelsets to be used for extraction. For more information on Labelsets, take a look at our Labeling Behavior article.

Labels are important for understanding data on a document. The way we can tell the difference between an invoice date and an order date on an invoice is by the labels on the document. We can also often tell what type of document we are working with based on the document's labels. You might expect an "Invoice Number" label on an invoice, but you wouldn't expect the same label to be on an Explanation of Benefits (EOB) document. In this way, labels can be used to help classify that document.