2.72:Classification Mockup - RP: Difference between revisions

From Grooper Wiki
Final // Edit via Wikitext Extension for VSCode
No edit summary
 
(13 intermediate revisions by 2 users not shown)
Line 1: Line 1:
__NOINDEX__
<blockquote>
<blockquote>
Classification is the process of assigning a '''Document Type''' (or other '''Content Type''') to a document in a '''Batch'''. A document must be assigned a '''Document Type''' for Grooper to know what to do with the document.
Classification is the process of assigning a '''Document Type''' to an unclassified document folder in a '''Batch'''. A document folder must be assigned a '''Document Type''' for Grooper to know what to do with the document it contains.
</blockquote>
</blockquote>


Line 8: Line 9:


# Acquire
# Acquire
#* This involves bringing in a '''Batch''' into Grooper. Usually, documents are scanned into Grooper and the initial '''Batch''' looks like just one long document with individual pages.  
#* Either physical pages are scanned into Grooper or digital files are imported into a '''Batch''' in Grooper.
# Condition
# Condition
#* This involves running Recognize and OCR on the '''Batch''' to allow Grooper to read the text and clean up the document if needed.  
#* This involves running Recognize and OCR on the '''Batch''' to allow Grooper to read the text and clean up the pages if needed.  
# Organize
# Organize
#* This is where you separate the documents in the '''Batch''' into individual folders.  
#* This is where you separate the pages in the Batch into individual document folders.
#* After the documents have been separated, then the documents need to go through Classification.
#* '''After the pages have been separated, then the document folders are Classified.'''
# Collect
#* Data is extracted from the documents.
# Deliver
#* The extracted data is exported from Grooper to the destination of your choice.  


Computer programs have no sense of intuition. Unless we tell Grooper that an invoice is an invoice, it won't know the difference between one document or another. This becomes problematic if we want to extract information from two different types of documents contained within the same '''Batch'''. If we want to extract names from a patient intake form and dollar amounts from an Explanation of Benefits document that are in the same '''Batch''', we have to tell Grooper which document is which so it extracts the correct information.  
Separation is part of the third phase: Organize.  


We assign a '''Content Type''', such as a '''Document Type''', to each document so Grooper knows that if a document has been assigned X '''Document Type''' then it needs to do Y with it. The process of assigning the '''Document Type''' to a document is called Classification.  
Computer programs have no sense of intuition. Unless we tell Grooper that an invoice is an invoice, it won't know the difference between an invoice and any other document like a college transcript. This becomes problematic if we want to extract information from two different types of documents contained within the same '''Batch'''.  


You can assign a '''Document Type''' to documents manually or, with the functionality in Grooper, we can automate this process. There are four different '''''Classification Methods''''' used to automate the Classification process:
If we want to extract names from a patient intake form and dollar amounts from an Explanation of Benefits (EOB) that are in the same '''Batch''', we have to tell Grooper which document is which so it extracts the correct information.
# Rules-Based Classification
# Lexical Classification
# Visual Classification
# Labelset-Based Classification


Which method you use generally depends on the type of documents you have within your '''Batch'''. Some methods lend themselves better to more structured documents like invoices or EOBs rather than unstructured documents like letters or leases. Let's go through each '''''Classification Method''''' individually.  
We assign a '''Document Type''' to each document folder so Grooper knows that if a document has been assigned X '''Document Type''' then it needs to do Y with it. The process of assigning the '''Document Type''' to a document is called Classification.  


{|class="attn-box"
{|class="attn-box"
Line 31: Line 32:
|⚠
|⚠
|
|
Before you can start configuring classification, you need to set the '''''Classification Method''''' on the '''Content Model'''. If this isn't set, Grooper won't know which method of classification to use on the '''Batch'''.  
All documents come into Grooper unclassified, so classification is ALWAYS required to move to the next phase of Grooper. Grooper cannot extract information from documents that have not been Classified.
|}
|}


=== Rules-Based Classification ===
=== Classification Methods ===
Before configuring classification, we need to make sure we have a '''''Classification Method''''' set. This property is found on the '''Content Model'''. If we do not set a '''''Classification Method''''' then Grooper won't know how we want to classify the documents.
 
Documents are actually classified during the '''Classify Step''' of a published '''Batch Process'''. The '''Classify Step''' looks at the '''Content Model''' to determine which '''''Classification Method''''' to use.
 
There are four different '''''Classification Methods''''' used for Classification:
 
# Rules-Based Classification
# Lexical Classification
# Visual Classification
# Labelset-Based Classification
 
Which method you use generally depends on the type of documents you have within your '''Batch'''. Some methods lend themselves better to more structured or semi-structured documents like W-4s or invoices rather than unstructured documents like letters or leases.
 
Let's go through each '''''Classification Method''''' individually.
 
==== Rules-Based Classification ====
{|class="fyi-box"
{|class="fyi-box"
|-
|-
Line 45: Line 62:
How do you tell what a document is? You might notice the document has a specific title or certain wording that is specific to that type of document. For example, you might expect to find an "Invoice Date" label on an invoice, but not on an Explanation of Benefits form. On a Federal W-4, you might actually see "W-4" listed as a title of the document.  
How do you tell what a document is? You might notice the document has a specific title or certain wording that is specific to that type of document. For example, you might expect to find an "Invoice Date" label on an invoice, but not on an Explanation of Benefits form. On a Federal W-4, you might actually see "W-4" listed as a title of the document.  


You can tell Grooper to classify any document that has an "Invoice Date" label as an invoice or that a document is a W-4 if it has "W-4" as part of the document. We do this by setting a '''''Positive Extractor''''' on each '''Data Type'''. If the '''''Positive Extractor''''' returns at least one result, the document will be classified as that '''Data Type'''.  
You can tell Grooper to classify any document that has an "Invoice Date" label as an invoice or any document that has "W-4" on it as a W-4. We do this by setting a '''''Positive Extractor''''' on each '''Document Type'''. If the '''''Positive Extractor''''' returns at least one result, the document will be classified as that '''Document Type'''.  


{|
{|cellpadding=10
|
|
[[File:2023 Classification Mockup 01 About 01 Rules Based 02.png]]
[[File:2023 Classification Mockup 01 About 01 Rules Based 02.png]]
Line 54: Line 71:
|}
|}


What if you run into a situation where the '''''Positive Extractor''''' is returning a result on two or more different documents, and there really isn't another good option to choose from for your extractor? You can use a '''''Negative Extractor''''' in addition to tell Grooper which documents should not be classified. Let's say that we have two W-4 documents, but one is a Federal W-4 and the other is an Iowa W-4. For the Federal W-4, we might set the '''''Positive Extractor''''' to capture the expression "W-4". We might then set the '''''Negative Extractor''''' to "Iowa" so Grooper knows that if the word "Iowa" appears on the document, then it should not be classified as a Federal W-4.
'''For more details on how to set up Rules-Based Classification, please see the [[Rules-Based (Classify Method)]] article.'''
 
{|
|
[[File:2023 Classification Mockup 01 About 01 Rules Based 03.png]]
|
[[File:2023 Classification Mockup 01 About 01 Rules Based 04.png]]
|}
 
Using a combination of '''''Positive Extractors''''' and '''''Negative Extractors''''', you can generally do a pretty good job of classifying structured or semi-structured documents.


=== Lexical Classification ===
==== Lexical Classification ====
{|class="fyi-box"
{|class="fyi-box"
|-
|-
Line 71: Line 79:
'''FYI'''
'''FYI'''
|
|
''Lexical'' Classification can work well for unstructured documents or documents that you have difficulty classifying using the ''Rules-Based'' method. You can also combine both ''Lexical'' and ''Rules-Based'' Classification to improve your results.  
''Lexical'' Classification can work well for most types of documents, both structured and unstructured. If the ''Rules-Based'' method won't give you the results you want, you can try ''Lexical'' classification. You can also combine both ''Lexical'' and ''Rules-Based'' Classification to improve your results.  
|}
|}


While labels or titles on a document can give a good indication of what the document is, we do not always have that information available. This is especially true on unstructured documents. So, how do we tell documents apart in this type of scenario?  
While labels or titles on a document can give a good indication of what the document is, we do not always have that information available. This is especially true on unstructured documents. So, how do we tell documents apart in this type of scenario?  


Generally, documents, even unstructured documents, have different language in them. You'd be more likely to see the word "oil" or "lease" on an oil and gas lease document than you would on a cover letter for a job. Using word frequency, we can train Grooper to recognize documents as different '''Document Types'''.  
Generally, documents, even unstructured documents, have different language in them. You'd be more likely to see the word "oil" or "lease" on an oil and gas lease document than you would on W-4. Using word frequency, we can train Grooper to recognize documents as different '''Document Types'''.  
 
Below we can see the first two Oil & Gas Leases we see the word "oil" and "lease" fairly frequently throughout the document, whereas the W-4 only has one instance of "lease" and that is as part of the word "release". Looking at the language alone, we can determine which documents are the Oil & Gas Leases.
{|cellpadding=10
|
[[File:2023 Classification Mockup 01 About 02 Lexical 01.png]]
|
[[File:2023 Classification Mockup 01 About 02 Lexical 02.png]]
|
[[File:2023 Classification Mockup 01 About 02 Lexical 03.png|810px]]
|}


The algorithm that's used to train Grooper on how to classify documents is known as Term Frequency-Inverse Document Frequency or TF-IDF. For more information on how this works, please see our [[TF-IDF]] article.  
The algorithm that's used to train Grooper on how to classify documents is known as Term Frequency-Inverse Document Frequency or TF-IDF. For more information on how this works, please see our [[TF-IDF]] article.  


=== Visual Classification ===
'''For more details on how to set up Lexical Classification, please see the [[Lexical (Classify Method)]] article.'''
 
==== Visual Classification ====
{|class="fyi-box"
{|class="fyi-box"
|-
|-
Line 91: Line 112:
''Visual'' Classification is different than the previous two types because it does not involve the language of the document. Rather, it involves the structure and overall look of the document. Grooper takes a look at the concentration of pixels and how they are arranged on a document to make a determination about what the '''Document Type''' should be.  
''Visual'' Classification is different than the previous two types because it does not involve the language of the document. Rather, it involves the structure and overall look of the document. Grooper takes a look at the concentration of pixels and how they are arranged on a document to make a determination about what the '''Document Type''' should be.  


We can see that these two documents have significantly different layouts.  
Here we have a Federal W-4 and an Iowa State W-4. We can see that these two highly structured documents have significantly different layouts.  


{|
{|cellpadding=10
|
|
[[File:2023 Classification Mockup 01 About 02 Visual 01.png|988px]]
[[File:2023 Classification Mockup 01 About 02 Visual 01.png]]
|
|
[[File:2023 Classification Mockup 01 About 02 Visual 02.png]]
[[File:2023 Classification Mockup 01 About 02 Visual 02.png]]
|}
|}


Grooper can look at how pixels are grouped on these documents to tell the difference between the two.  
''Visual Classification'' requires an '''IP Profile''' with a configured '''Extract Features step'''. For more information on IP Profiles, take a look at our [[IP Profile]] article.
 
The '''Extract Features step''' will binarize and intensify the images. It will take those images and analyze the pixels and create a grid pattern Grooper can better understand.  


However, if two documents look too similar Grooper may not be able to differentiate between the two. This is the downside to ''Visual'' Classification. This method should only really be used if you know you have documents that significantly differ from one another in their layout.  
The two documents below are the result when the documents above (the Federal W-4 and the Iowa W-4) have had the '''Extract Features step''' applied. We can see a distinct difference between the two images. Grooper will be able to tell the two images apart based on the variations of color in the grid layout.
 
{|cellpadding=10
|
[[File:2023 Classification Mockup 01 About 02 Visual 03.png]]
|
[[File:2023 Classification Mockup 01 About 02 Visual 04.png]]
|}
 
However, if two different types of documents look too similar, Grooper may not be able to differentiate between the two. This is the downside to ''Visual'' Classification. This method should only really be used if you know your documents of different types significantly differ from one another in their layout.  
 
'''For more details on how to set up Visual Classification, please see the [[Visual (Classify Method)]] article.'''


=== Labelset-Based Classification ===
=== Labelset-Based Classification ===
Line 111: Line 145:
'''FYI'''
'''FYI'''
|
|
''Labelset-Based'' Classification generally works best with structured and semi-structured document. ''Labelset-Based'' Classification relies on documents of the same type having similar labels and for Labelsets to be used for extraction. For more information on Labelsets, take a look at our [[Labeling Behavior]] article.
''Labelset-Based'' Classification generally works best with semi-structured documents. ''Labelset-Based'' Classification relies on documents of the same type having similar labels and for Labelsets to be used for extraction. For more information on Labelsets, take a look at our [[Labeling Behavior]] article.
|}
 
Labels are important for understanding data on a document. Even among similar documents, labels that reference the same type of data may be different.
 
For example, take a look at the two invoices below. Both of these documents have an Invoice Number and an Invoice Date. However, the labels that indicate these fields are different.
 
On the Stuff and Things invoice, the invoice number field has "Invoice Number:" as its label, whereas the Envoy invoice has "Invoice" as its label for the same information. We can tell these two invoices apart based on what labels are used to collect the same information. In this way, labels can be used to help classify that document.
 
{|cellpadding=10
|
[[File:2023 Classification Mockup 01 About 04 Labelsets 01.png]]
|
[[File:2023 Classification Mockup 01 About 04 Labelsets 02.png]]
|}
|}


Labels are important for understanding data on a document. The way we can tell the difference between an invoice date and an order date on an invoice is by the labels on the document. We can also often tell what type of document we are working with based on the document's labels. You might expect an "Invoice Number" label on an invoice, but you wouldn't expect the same label to be on an Explanation of Benefits (EOB) document. In this way, labels can be used to help classify that document.
'''For more details on how to set up Labelset-Based Classification, please see the [[Labelset-Based (Classification Method)]] article.'''

Latest revision as of 09:01, 5 August 2025

Classification is the process of assigning a Document Type to an unclassified document folder in a Batch. A document folder must be assigned a Document Type for Grooper to know what to do with the document it contains.

About

Let's revisit the first three of the five phases of Grooper.

  1. Acquire
    • Either physical pages are scanned into Grooper or digital files are imported into a Batch in Grooper.
  2. Condition
    • This involves running Recognize and OCR on the Batch to allow Grooper to read the text and clean up the pages if needed.
  3. Organize
    • This is where you separate the pages in the Batch into individual document folders.
    • After the pages have been separated, then the document folders are Classified.
  4. Collect
    • Data is extracted from the documents.
  5. Deliver
    • The extracted data is exported from Grooper to the destination of your choice.

Separation is part of the third phase: Organize.

Computer programs have no sense of intuition. Unless we tell Grooper that an invoice is an invoice, it won't know the difference between an invoice and any other document like a college transcript. This becomes problematic if we want to extract information from two different types of documents contained within the same Batch.

If we want to extract names from a patient intake form and dollar amounts from an Explanation of Benefits (EOB) that are in the same Batch, we have to tell Grooper which document is which so it extracts the correct information.

We assign a Document Type to each document folder so Grooper knows that if a document has been assigned X Document Type then it needs to do Y with it. The process of assigning the Document Type to a document is called Classification.

All documents come into Grooper unclassified, so classification is ALWAYS required to move to the next phase of Grooper. Grooper cannot extract information from documents that have not been Classified.

Classification Methods

Before configuring classification, we need to make sure we have a Classification Method set. This property is found on the Content Model. If we do not set a Classification Method then Grooper won't know how we want to classify the documents.

Documents are actually classified during the Classify Step of a published Batch Process. The Classify Step looks at the Content Model to determine which Classification Method to use.

There are four different Classification Methods used for Classification:

  1. Rules-Based Classification
  2. Lexical Classification
  3. Visual Classification
  4. Labelset-Based Classification

Which method you use generally depends on the type of documents you have within your Batch. Some methods lend themselves better to more structured or semi-structured documents like W-4s or invoices rather than unstructured documents like letters or leases.

Let's go through each Classification Method individually.

Rules-Based Classification

FYI

Rules-Based Classification works best on structured or semi-structured documents. For unstructured documents, it might be more advantageous to use Lexical Classification or a mixture of both Rules-Based and Lexical Classification Methods.

How do you tell what a document is? You might notice the document has a specific title or certain wording that is specific to that type of document. For example, you might expect to find an "Invoice Date" label on an invoice, but not on an Explanation of Benefits form. On a Federal W-4, you might actually see "W-4" listed as a title of the document.

You can tell Grooper to classify any document that has an "Invoice Date" label as an invoice or any document that has "W-4" on it as a W-4. We do this by setting a Positive Extractor on each Document Type. If the Positive Extractor returns at least one result, the document will be classified as that Document Type.

For more details on how to set up Rules-Based Classification, please see the Rules-Based (Classify Method) article.

Lexical Classification

FYI

Lexical Classification can work well for most types of documents, both structured and unstructured. If the Rules-Based method won't give you the results you want, you can try Lexical classification. You can also combine both Lexical and Rules-Based Classification to improve your results.

While labels or titles on a document can give a good indication of what the document is, we do not always have that information available. This is especially true on unstructured documents. So, how do we tell documents apart in this type of scenario?

Generally, documents, even unstructured documents, have different language in them. You'd be more likely to see the word "oil" or "lease" on an oil and gas lease document than you would on W-4. Using word frequency, we can train Grooper to recognize documents as different Document Types.

Below we can see the first two Oil & Gas Leases we see the word "oil" and "lease" fairly frequently throughout the document, whereas the W-4 only has one instance of "lease" and that is as part of the word "release". Looking at the language alone, we can determine which documents are the Oil & Gas Leases.

The algorithm that's used to train Grooper on how to classify documents is known as Term Frequency-Inverse Document Frequency or TF-IDF. For more information on how this works, please see our TF-IDF article.

For more details on how to set up Lexical Classification, please see the Lexical (Classify Method) article.

Visual Classification

FYI

Visual Classification generally only works for highly structured document. Documents of the same type need to be visually similar to each other and visually different from other types.

Visual Classification is different than the previous two types because it does not involve the language of the document. Rather, it involves the structure and overall look of the document. Grooper takes a look at the concentration of pixels and how they are arranged on a document to make a determination about what the Document Type should be.

Here we have a Federal W-4 and an Iowa State W-4. We can see that these two highly structured documents have significantly different layouts.

Visual Classification requires an IP Profile with a configured Extract Features step. For more information on IP Profiles, take a look at our IP Profile article.

The Extract Features step will binarize and intensify the images. It will take those images and analyze the pixels and create a grid pattern Grooper can better understand.

The two documents below are the result when the documents above (the Federal W-4 and the Iowa W-4) have had the Extract Features step applied. We can see a distinct difference between the two images. Grooper will be able to tell the two images apart based on the variations of color in the grid layout.

However, if two different types of documents look too similar, Grooper may not be able to differentiate between the two. This is the downside to Visual Classification. This method should only really be used if you know your documents of different types significantly differ from one another in their layout.

For more details on how to set up Visual Classification, please see the Visual (Classify Method) article.

Labelset-Based Classification

FYI

Labelset-Based Classification generally works best with semi-structured documents. Labelset-Based Classification relies on documents of the same type having similar labels and for Labelsets to be used for extraction. For more information on Labelsets, take a look at our Labeling Behavior article.

Labels are important for understanding data on a document. Even among similar documents, labels that reference the same type of data may be different.

For example, take a look at the two invoices below. Both of these documents have an Invoice Number and an Invoice Date. However, the labels that indicate these fields are different.

On the Stuff and Things invoice, the invoice number field has "Invoice Number:" as its label, whereas the Envoy invoice has "Invoice" as its label for the same information. We can tell these two invoices apart based on what labels are used to collect the same information. In this way, labels can be used to help classify that document.

For more details on how to set up Labelset-Based Classification, please see the Labelset-Based (Classification Method) article.