2.72:Classification Mockup - RP: Difference between revisions

Revision as of 13:33, 4 January 2024

Classification is the process of assigning a Document Type (or other Content Type) to a document in a Batch. A document must be assigned a Document Type for Grooper to know what to do with the document.

About

Let's revisit the first three of the five phases of Grooper.

Acquire
- This involves bringing in a Batch into Grooper. Usually, documents are scanned into Grooper and the initial Batch looks like just one long document with individual pages.
Condition
- This involves running Recognize and OCR on the Batch to allow Grooper to read the text and clean up the document if needed.
Organize
- This is where you separate the documents in the Batch into individual folders.
- After the documents have been separated, then the documents need to go through Classification.

Computer programs have no sense of intuition. Unless we tell Grooper that an invoice is an invoice, it won't know the difference between one document or another. This becomes problematic if we want to extract information from two different types of documents contained within the same Batch. If we want to extract names from a patient intake form and dollar amounts from an Explanation of Benefits document that are in the same Batch, we have to tell Grooper which document is which so it extracts the correct information.

We assign a Content Type, such as a Document Type, to each document so Grooper knows that if a document has been assigned X Document Type then it needs to do Y with it. The process of assigning the Document Type to a document is called Classification.

You can assign a Document Type to documents manually or, with the functionality in Grooper, we can automate this process. There are four different Classification Methods used to automate the Classification process:

Rules-Based Classification
Lexical Classification
Visual Classification
Labelset-Based Classification

Which method you use generally depends on the type of documents you have within your Batch. Some methods lend themselves better to more structured documents like invoices or EOBs rather than unstructured documents like letters or leases. Let's go through each Classification Method individually.

⚠	Before you can start configuring classification, you need to set the Classification Method on the Content Model. If this isn't set, Grooper won't know which method of classification to use on the Batch.

Rules-Based Classification

How do you tell what a document is? You might notice the document has a specific title or certain wording that is specific to that type of document. For example, you might expect to find an "Invoice Date" label on an invoice, but not on an Explanation of Benefits form. On a Federal W-4, you might actually see "W-4" listed as a title of the document.

You can tell Grooper to classify any document that has an "Invoice Date" label as an invoice or that a document is a W-4 if it has "W-4" as part of the document. We do this by setting a Positive Extractor on each Data Type. If the Positive Extractor returns at least one result, the document will be classified as that Data Type.

@@ Line 1: / Line 1: @@
-This is a placeholder for this article.
+<blockquote>
+Classification is the process of assigning a '''Document Type''' (or other '''Content Type''') to a document in a '''Batch'''. A document must be assigned a '''Document Type''' for Grooper to know what to do with the document.
+</blockquote>
+== About ==
+Let's revisit the first three of the five phases of Grooper.
+# Acquire
+#* This involves bringing in a '''Batch''' into Grooper. Usually, documents are scanned into Grooper and the initial '''Batch''' looks like just one long document with individual pages.
+# Condition
+#* This involves running Recognize and OCR on the '''Batch''' to allow Grooper to read the text and clean up the document if needed.
+# Organize
+#* This is where you separate the documents in the '''Batch''' into individual folders.
+#* After the documents have been separated, then the documents need to go through Classification.
+Computer programs have no sense of intuition. Unless we tell Grooper that an invoice is an invoice, it won't know the difference between one document or another. This becomes problematic if we want to extract information from two different types of documents contained within the same '''Batch'''. If we want to extract names from a patient intake form and dollar amounts from an Explanation of Benefits document that are in the same '''Batch''', we have to tell Grooper which document is which so it extracts the correct information.
+We assign a '''Content Type''', such as a '''Document Type''', to each document so Grooper knows that if a document has been assigned X '''Document Type''' then it needs to do Y with it. The process of assigning the '''Document Type''' to a document is called Classification.
+You can assign a '''Document Type''' to documents manually or, with the functionality in Grooper, we can automate this process. There are four different '''''Classification Methods''''' used to automate the Classification process:
+# Rules-Based Classification
+# Lexical Classification
+# Visual Classification
+# Labelset-Based Classification
+Which method you use generally depends on the type of documents you have within your '''Batch'''. Some methods lend themselves better to more structured documents like invoices or EOBs rather than unstructured documents like letters or leases. Let's go through each '''''Classification Method''''' individually.
+{|class="attn-box"
+|-
+|⚠
+|
+Before you can start configuring classification, you need to set the '''''Classification Method''''' on the '''Content Model'''. If this isn't set, Grooper won't know which method of classification to use on the '''Batch'''.
+|}
+=== Rules-Based Classification ===
+How do you tell what a document is? You might notice the document has a specific title or certain wording that is specific to that type of document. For example, you might expect to find an "Invoice Date" label on an invoice, but not on an Explanation of Benefits form. On a Federal W-4, you might actually see "W-4" listed as a title of the document.
+You can tell Grooper to classify any document that has an "Invoice Date" label as an invoice or that a document is a W-4 if it has "W-4" as part of the document. We do this by setting a '''''Positive Extractor''''' on each '''Data Type'''. If the '''''Positive Extractor''''' returns at least one result, the document will be classified as that '''Data Type'''.
+=== Lexical Classification ===
+=== Visual Classification ===
+=== Labelset-Based Classification ===