2.72:What is Classification - DSmith: Difference between revisions

From Grooper Wiki
No edit summary
 
(35 intermediate revisions by 2 users not shown)
Line 1: Line 1:
__NOINDEX__
== Overview ==
== Overview ==
''Classification'' is an Activity in Grooper that allows the assigning of a Content Type to a Document. While we as humans may be able to classify a document by reading it (or its title should it have one), to Grooper all documents that come in are unclassified, or "blank". If we want Grooper to know what a Purchase Order is, or be able to tell the difference between a Purchase Order and an invoice, we have to tell it; and we do that through Classification.
''Classification'' is an Activity in Grooper that allows the assigning of a Content Type to a Document. While we as humans may be able to classify a document by reading it (or its title should it have one), to Grooper all documents that come in are unclassified, or "blank". If we want Grooper to know what a Purchase Order is, or be able to tell the difference between a Purchase Order and an invoice, we have to tell it; and we do that through ''Classification''.
 
 
== Why Classify? ==
Why is classification necessary? Why does it matter in Grooper? Isn't enough for humans to look at a document in a document folder and know what it is? No. Why? Well, as a user, you would want properly extracted information, wouldn't you? How else would you expect Grooper to extract an invoice number from an invoice and a patient name from a medical history form when Grooper can't even tell the difference between the two?


== Classification Methods ==
== Classification Methods ==
In order to classify a document, you must choose between four different Classification Methods. They are:
In order to classify a document, you must choose between four different '''Classification Methods'''. They are:
<br>
<br>
<br>
<br>
* Rules-Based
* '''Rules-Based'''
* Lableset-Based
* '''Lableset-Based'''
* Lexical
* '''Lexical'''
* Visual
* '''Visual'''
<br>
<br>
<br>
<br>
These methods can be set on the Content Model via the Classification Method property. Whatever method you choose is largely based on what sort of document you have; its structure, complexity, so on and so forth. We will provide a brief overview of each Classification Method here.
These methods can be set on the '''Content Model''' via the '''Classification Method''' property. Whatever method you choose is largely based on what sort of document you have; its structure and complexity. We will provide a brief overview of each '''Classification Method''' here.


For more detailed information about each Classification Method, click the following links:
For more detailed information about each '''Classification Method''', click the following links:


* [[2023:Rules-Based (Classification Method)|Rules-Based]]
* [[Rules-Based]]
* [[Labeling Behavior|Labelset-Based]]
* [[Labeling Behavior|Labelset-Based]]
* [[2023:Lexical (Classification Method)|Lexical]]
* [[Lexical]]
* [[Visual (Classification Method)|Visual]]
* [[Visual]]


=== Rules-Based===
=== Rules-Based===
Rules-Based Classification works by using classification rules set up on a Document Type. What exactly are these rules? Whatever you tell Grooper they are. To elaborate, let's say you have Batch of documents that consists of Invoices, Purchase Orders, and Guest Speaker Agreements. A human would be able to tell the differences between these documents simply by reading their titles. To Grooper, they are all just documents covered in pixels. Unless of course, we tell Grooper how to tell between each of the three types of document. We do this by setting up a Document Type for each of the three documents (Invoice, PO, Guest Speaker Agreement) and telling Grooper what it needs to look for to be able to tell the difference between the three. For example, you can tell  Grooper via Positive Extractor on the Document Type that an invoice will have will have a label such as an Invoice Number, a Purchase Order will have a Purchase Order number, and a Guest Speaker Agreement will be titled as such.
Documents have ways of identifying themselves to a human reader. This can be through a title, labels, content, or all of the above. We can make use of this information and set it up as rules that Grooper can then use to classify documents. What are these rules? Whatever we tell Grooper they are. Take a look at these two documents:
<br>
<br>
{|cellpadding=10 cellspacing=5
|style="width:40%" valign=top|
[[File:Xavier_Transcript.png]]
|
|
[[File:Direct_Deposit_Form_Blank.png]]
|}
<br>
<br>
If we want Grooper to classify the transcript as a transcript, we can tell it to look for information like student names/ID numbers, class information, or school name. With the direct deposit form, we can have Grooper classify it based off of its title. Thus, we have established the rules by which those documents will be classified.


=== Labelset-Based===
=== Labelset-Based===
Labels are a staple of semi-structured documents. Labels can be used to help identify various pieces of information on a document. We can then use those labels for Labelset-Based Classification to help Grooper classify documents. To continue with our previous example, you would naturally expect purchase orders and invoices to have different labels with which they organize their content.
Labels are a great way for humans to organize, categorize, and digest information. The same can be said for labels when it comes to Grooper. Lableset-Based Classification is good for document types where the information is exactly the same, but the labels might be different.
<br>
<br>
{|
For example, look at these two invoices. The information is exactly the same, but the labels that categorize the invoice number are different.
|
[[File:Skyvapor_Invoice_Date_Label.png]]
|-
|
[[File:Syergy_Invoice_No_Label.png]]
|}
<br>
<br>
Of course, Lableset-Based Classification can be used with just about any document that organizes its information using labels.


=== Lexical ===
=== Lexical ===
Unfortunately, not every document is structured, or has labels to help out both humans and Grooper with classification. This is where Lexical Classification comes in. Here we have a Guest Speaker Agreement [[insert image here]]. Really, the only thing that identifies it as such is its title. While there is labeled information (the speaker's name and any organization they're associated with), that is not located on the first page, and could therefore tamper with Classification.
Unfortunately, not every document has labels that make classification easy. You might have two documents with different titles, but whose structure is almost identical, thus ruling out Visual Classification. And if their titles are similar? Well, there goes Rules-Based classification. Luckily, there is the Lexical classification method that relies on the text data within a document. Specifically, it relies on how many times a piece of searchable text data appears within a document, and classifies it accordingly.
<br>
<br>
Look at these two documents here. No labels, only one has a title, and their visual structure is similar. So, what do we do?
{|
[[File:oil_and_gas_lease.png|500x550px]]
|
|-
[[File:Short_Cover_Letter_Example.png|500x550px]]
|}
Well a cover letter and an oil and gas lease have different content. We would expect to see words like "oil" and "gas" appear more frequently on the lease than on the cover letter. Likewise, we expect words related to employment to appear on a cover letter as opposed to an oil and gas lease. Thus, we can use the frequency of certain words to help classify each document.


=== Visual ===
=== Visual ===
Visual Classification is different. Unlike the previous three methods mentioned here, Visual Classification relies upon the structure of the document itself rather than the language present on the document. Take a look at these two documents here. Instead of focusing on labels, titles, or a piece of recurring text, we can have Grooper concentrate on how the pixels are grouped together and classify documents that way.
Visual Classification is different. Unlike the previous three methods mentioned here, Visual Classification relies upon the pixel grouping of the document itself rather than the language present on the document. Take a look at these two documents here. Instead of focusing on labels, titles, or a piece of recurring text, we can have Grooper concentrate on how the pixels are grouped together and classify documents that way.
{|
{|cellpadding=10 cellspacing=5
|style="width:40%" valign=top|
We can tell Grooper that documents structured like this are invoices
We can tell Grooper that documents structured like this are invoices
|-
|
|
[[File:Dos_Mangos_Invoice(1).png|275x315px]]
[[File:Dos_Mangos_Invoice(1).png|275x315px]]
Line 42: Line 82:
|
|
[[File:NDA_Example.png|275x315px]]
[[File:NDA_Example.png|275x315px]]
|-
|
Unfortunately, if our documents are similar in how they group their pixels, then Grooper will have difficulty classifying them, and may even classify them as the same document. Such is the downside to Visual Classification.
|
[[File:Rental_Agreement.png]]
|
[[File:Residential_Lease_Agreement_Sample.png]]
|}
|}
Unfortunately, if our documents are similar in structure, then Grooper will have difficulty classifying them, and may even classify them as the same document. Such is the downside to Visual Classification.
[[insert images of two similar documents]]

Latest revision as of 09:04, 5 August 2025

Overview

Classification is an Activity in Grooper that allows the assigning of a Content Type to a Document. While we as humans may be able to classify a document by reading it (or its title should it have one), to Grooper all documents that come in are unclassified, or "blank". If we want Grooper to know what a Purchase Order is, or be able to tell the difference between a Purchase Order and an invoice, we have to tell it; and we do that through Classification.


Why Classify?

Why is classification necessary? Why does it matter in Grooper? Isn't enough for humans to look at a document in a document folder and know what it is? No. Why? Well, as a user, you would want properly extracted information, wouldn't you? How else would you expect Grooper to extract an invoice number from an invoice and a patient name from a medical history form when Grooper can't even tell the difference between the two?

Classification Methods

In order to classify a document, you must choose between four different Classification Methods. They are:

  • Rules-Based
  • Lableset-Based
  • Lexical
  • Visual



These methods can be set on the Content Model via the Classification Method property. Whatever method you choose is largely based on what sort of document you have; its structure and complexity. We will provide a brief overview of each Classification Method here.

For more detailed information about each Classification Method, click the following links:

Rules-Based

Documents have ways of identifying themselves to a human reader. This can be through a title, labels, content, or all of the above. We can make use of this information and set it up as rules that Grooper can then use to classify documents. What are these rules? Whatever we tell Grooper they are. Take a look at these two documents:



If we want Grooper to classify the transcript as a transcript, we can tell it to look for information like student names/ID numbers, class information, or school name. With the direct deposit form, we can have Grooper classify it based off of its title. Thus, we have established the rules by which those documents will be classified.

Labelset-Based

Labels are a great way for humans to organize, categorize, and digest information. The same can be said for labels when it comes to Grooper. Lableset-Based Classification is good for document types where the information is exactly the same, but the labels might be different.

For example, look at these two invoices. The information is exactly the same, but the labels that categorize the invoice number are different.



Of course, Lableset-Based Classification can be used with just about any document that organizes its information using labels.

Lexical

Unfortunately, not every document has labels that make classification easy. You might have two documents with different titles, but whose structure is almost identical, thus ruling out Visual Classification. And if their titles are similar? Well, there goes Rules-Based classification. Luckily, there is the Lexical classification method that relies on the text data within a document. Specifically, it relies on how many times a piece of searchable text data appears within a document, and classifies it accordingly.

Look at these two documents here. No labels, only one has a title, and their visual structure is similar. So, what do we do?

Well a cover letter and an oil and gas lease have different content. We would expect to see words like "oil" and "gas" appear more frequently on the lease than on the cover letter. Likewise, we expect words related to employment to appear on a cover letter as opposed to an oil and gas lease. Thus, we can use the frequency of certain words to help classify each document.

Visual

Visual Classification is different. Unlike the previous three methods mentioned here, Visual Classification relies upon the pixel grouping of the document itself rather than the language present on the document. Take a look at these two documents here. Instead of focusing on labels, titles, or a piece of recurring text, we can have Grooper concentrate on how the pixels are grouped together and classify documents that way.

We can tell Grooper that documents structured like this are invoices

And documents structured like this are legal documents.

Unfortunately, if our documents are similar in how they group their pixels, then Grooper will have difficulty classifying them, and may even classify them as the same document. Such is the downside to Visual Classification.