2.72:Classification Mockup - RP: Difference between revisions

Revision as of 10:30, 8 January 2024

Classification is the process of assigning a Document Type (or other Content Type) to a document in a Batch. A document must be assigned a Document Type for Grooper to know what to do with the document.

About

Let's revisit the first three of the five phases of Grooper.

Acquire
- This involves bringing in a Batch into Grooper. Usually, documents are scanned into Grooper and the initial Batch looks like just one long document with individual pages.
Condition
- This involves running Recognize and OCR on the Batch to allow Grooper to read the text and clean up the document if needed.
Organize
- This is where you separate the documents in the Batch into individual folders.
- After the documents have been separated, then the documents need to go through Classification.

Computer programs have no sense of intuition. Unless we tell Grooper that an invoice is an invoice, it won't know the difference between one document or another. This becomes problematic if we want to extract information from two different types of documents contained within the same Batch. If we want to extract names from a patient intake form and dollar amounts from an Explanation of Benefits document that are in the same Batch, we have to tell Grooper which document is which so it extracts the correct information.

We assign a Content Type, such as a Document Type, to each document so Grooper knows that if a document has been assigned X Document Type then it needs to do Y with it. The process of assigning the Document Type to a document is called Classification.

You can assign a Document Type to documents manually or, with the functionality in Grooper, we can automate this process. There are four different Classification Methods used to automate the Classification process:

Rules-Based Classification
Lexical Classification
Visual Classification
Labelset-Based Classification

Which method you use generally depends on the type of documents you have within your Batch. Some methods lend themselves better to more structured documents like invoices or EOBs rather than unstructured documents like letters or leases. Let's go through each Classification Method individually.

⚠	Before you can start configuring classification, you need to set the Classification Method on the Content Model. If this isn't set, Grooper won't know which method of classification to use on the Batch.

Rules-Based Classification

FYI

Rules-Based Classification works best on structured or semi-structured documents. For unstructured documents, it might be more advantageous to use Lexical Classification or a mixture of both Rules-Based and Lexical Classification Methods.

How do you tell what a document is? You might notice the document has a specific title or certain wording that is specific to that type of document. For example, you might expect to find an "Invoice Date" label on an invoice, but not on an Explanation of Benefits form. On a Federal W-4, you might actually see "W-4" listed as a title of the document.

You can tell Grooper to classify any document that has an "Invoice Date" label as an invoice or that a document is a W-4 if it has "W-4" as part of the document. We do this by setting a Positive Extractor on each Data Type. If the Positive Extractor returns at least one result, the document will be classified as that Data Type.

What if you run into a situation where the Positive Extractor is returning a result on two or more different documents, and there really isn't another good option to choose from for your extractor? You can use a Negative Extractor in addition to tell Grooper which documents should not be classified. Let's say that we have two W-4 documents, but one is a Federal W-4 and the other is an Iowa W-4. For the Federal W-4, we might set the Positive Extractor to capture the expression "W-4". We might then set the Negative Extractor to "Iowa" so Grooper knows that if the word "Iowa" appears on the document, then it should not be classified as a Federal W-4.

Using a combination of Positive Extractors and Negative Extractors, you can generally do a pretty good job of classifying structured or semi-structured documents.

Lexical Classification

FYI

Lexical Classification can work well for unstructured documents or documents that you have difficulty classifying using the Rules-Based method. You can also combine both Lexical and Rules-Based Classification to improve your results.

While labels or titles on a document can give a good indication of what the document is, we do not always have that information available. This is especially true on unstructured documents. So, how do we tell documents apart in this type of scenario?

Generally, documents, even unstructured documents, have different language in them. You'd be more likely to see the word "oil" or "lease" on an oil and gas lease document than you would on a cover letter for a job. Using word frequency, we can train Grooper to recognize documents as different Document Types.

The algorithm that's used to train Grooper on how to classify documents is known as Term Frequency-Inverse Document Frequency or TF-IDF. For more information on how this works, please see our TF-IDF article.

Visual Classification

FYI

Visual Classification generally only works for highly structured document. Documents of the same type need to be visually similar to each other and visually different from other types.

@@ Line 35: / Line 35: @@
 === Rules-Based Classification ===
+{|class="fyi-box"
+|-
+|
+'''FYI'''
+|
+''Rules-Based'' Classification works best on structured or semi-structured documents. For unstructured documents, it might be more advantageous to use ''Lexical'' Classification or a mixture of both ''Rules-Based'' and ''Lexical'' '''''Classification Methods'''''.
+|}
 How do you tell what a document is? You might notice the document has a specific title or certain wording that is specific to that type of document. For example, you might expect to find an "Invoice Date" label on an invoice, but not on an Explanation of Benefits form. On a Federal W-4, you might actually see "W-4" listed as a title of the document.
 You can tell Grooper to classify any document that has an "Invoice Date" label as an invoice or that a document is a W-4 if it has "W-4" as part of the document. We do this by setting a '''''Positive Extractor''''' on each '''Data Type'''. If the '''''Positive Extractor''''' returns at least one result, the document will be classified as that '''Data Type'''.
+What if you run into a situation where the '''''Positive Extractor''''' is returning a result on two or more different documents, and there really isn't another good option to choose from for your extractor? You can use a '''''Negative Extractor''''' in addition to tell Grooper which documents should not be classified. Let's say that we have two W-4 documents, but one is a Federal W-4 and the other is an Iowa W-4. For the Federal W-4, we might set the '''''Positive Extractor''''' to capture the expression "W-4". We might then set the '''''Negative Extractor''''' to "Iowa" so Grooper knows that if the word "Iowa" appears on the document, then it should not be classified as a Federal W-4.
+Using a combination of '''''Positive Extractors''''' and '''''Negative Extractors''''', you can generally do a pretty good job of classifying structured or semi-structured documents.
 === Lexical Classification ===
+{|class="fyi-box"
+|-
+|
+'''FYI'''
+|
+''Lexical'' Classification can work well for unstructured documents or documents that you have difficulty classifying using the ''Rules-Based'' method. You can also combine both ''Lexical'' and ''Rules-Based'' Classification to improve your results.
+|}
+While labels or titles on a document can give a good indication of what the document is, we do not always have that information available. This is especially true on unstructured documents. So, how do we tell documents apart in this type of scenario?
+Generally, documents, even unstructured documents, have different language in them. You'd be more likely to see the word "oil" or "lease" on an oil and gas lease document than you would on a cover letter for a job. Using word frequency, we can train Grooper to recognize documents as different '''Document Types'''.
+The algorithm that's used to train Grooper on how to classify documents is known as Term Frequency-Inverse Document Frequency or TF-IDF. For more information on how this works, please see our [[TF-IDF]] article.
 === Visual Classification ===
+{|class="fyi-box"
+|-
+|
+'''FYI'''
+|
+''Visual'' Classification generally only works for highly structured document. Documents of the same type need to be visually similar to each other and visually different from other types.
+|}
 === Labelset-Based Classification ===