2023:Rules-Based (Classification Method): Difference between revisions
No edit summary |
No edit summary |
||
Line 1: | Line 1: | ||
<!--#region Intro--> | |||
{|class="wip-box" | {|class="wip-box" | ||
| | | | ||
Line 18: | Line 19: | ||
Classification is then performed using the '''[[Classify]]''' activity, using the extraction rules established by the '''''Positive''''' and '''''Negative Extractor''''' properties of '''Document Types''' in a '''Content Model'''. | Classification is then performed using the '''[[Classify]]''' activity, using the extraction rules established by the '''''Positive''''' and '''''Negative Extractor''''' properties of '''Document Types''' in a '''Content Model'''. | ||
<!--#endregion--> | |||
== About == | == About == | ||
Line 25: | Line 26: | ||
[[File:Rules- | [[File:2023 Rules Based (Classification Method) - 2023 01 About 01.png|center|1000px]] | ||
<!--#region What Are You Classifying?--> | |||
=== What are you classifying? - Document Types === | === What are you classifying? - Document Types === | ||
Line 50: | Line 51: | ||
{|cellpadding=10 cellspacing=5 | {|cellpadding=10 cellspacing=5 | ||
|style="width:25%"| | |style="width:25%"| | ||
[[File: | [[File:2023 Rules Based (Classification Method) - 2023 01 About 02.png]] | ||
|valign=top| | |valign=top| | ||
A '''Content Model''' is how we determine the ''taxonomy'' of our documents set. Taxonomy is just a fancy word for a classification scheme. Zoological taxonomy organizes organisms into a classification scheme, from domain all the way down to species. We do much the same thing with documents and a '''Content Model'''. | A '''Content Model''' is how we determine the ''taxonomy'' of our documents set. Taxonomy is just a fancy word for a classification scheme. Zoological taxonomy organizes organisms into a classification scheme, from domain all the way down to species. We do much the same thing with documents and a '''Content Model'''. | ||
Line 56: | Line 57: | ||
The whole set of HR documents belong to the top level in the hierarchy, the '''Content Model''' itself. Each individual kind of document are represented by '''Document Types''', which are next level down in that hierarchy. Each one is distinct from each other, but still part of the '''Content Model's''' scope. Just like insects, spiders, and lobsters are distinct from each other but are all part of the "arthropod" zoological class. | The whole set of HR documents belong to the top level in the hierarchy, the '''Content Model''' itself. Each individual kind of document are represented by '''Document Types''', which are next level down in that hierarchy. Each one is distinct from each other, but still part of the '''Content Model's''' scope. Just like insects, spiders, and lobsters are distinct from each other but are all part of the "arthropod" zoological class. | ||
|} | |} | ||
<!--#endregion--> | |||
<!--#region Positive and Negative Extractors--> | |||
=== How are the documents classified? - Positive and Negative Extractors === | === How are the documents classified? - Positive and Negative Extractors === | ||
Line 66: | Line 68: | ||
# These properties will be found in the '''''Classification''''' properties in the property panel. | # These properties will be found in the '''''Classification''''' properties in the property panel. | ||
| | | | ||
[[File:Rules- | [[File:2023 Rules Based (Classification Method) - 2023 01 About 03.png]] | ||
|} | |} | ||
Line 78: | Line 80: | ||
For example, the document here is titled "DATA INFORMATION SECTION", which is easily matched by the regular expression <code>DATA INFORMATION SECTION</code>. | For example, the document here is titled "DATA INFORMATION SECTION", which is easily matched by the regular expression <code>DATA INFORMATION SECTION</code>. | ||
| | | | ||
[[File:Rules- | [[File:2023 Rules Based (Classification Method) - 2023 01 About 04.png]] | ||
|- | |- | ||
|valign=top| | |valign=top| | ||
Line 90: | Line 92: | ||
|} | |} | ||
| | | | ||
[[File:Rules- | [[File:2023 Rules Based (Classification Method) - 2023 01 About 05.png]] | ||
|- | |- | ||
|valign=top| | |valign=top| | ||
Line 104: | Line 106: | ||
# If the '''''Positive Extractor''''' does not return a result, the '''Document Folder''' remains unclassified. | # If the '''''Positive Extractor''''' does not return a result, the '''Document Folder''' remains unclassified. | ||
| | | | ||
[[File:Rules- | [[File:2023 Rules Based (Classification Method) - 2023 01 About 06.png]] | ||
|} | |} | ||
Line 117: | Line 119: | ||
This certainly produces results for this Federal W-4 form here. It will accurately positively classify this document as a "Federal W-4" '''Document Type'''. However, this is a very general pattern. If the characters "W-4" are found on any other document that ''isn't'' a Federal W-4 form, it will classify it as a "Federal W-4" '''Document Type'''. | This certainly produces results for this Federal W-4 form here. It will accurately positively classify this document as a "Federal W-4" '''Document Type'''. However, this is a very general pattern. If the characters "W-4" are found on any other document that ''isn't'' a Federal W-4 form, it will classify it as a "Federal W-4" '''Document Type'''. | ||
| | | | ||
[[File:Rules- | [[File:2023 Rules Based (Classification Method) - 2023 01 About 07.png]] | ||
|- | |- | ||
|valign=top| | |valign=top| | ||
Line 127: | Line 129: | ||
# So, it gets classified as a "Federal W-4" '''Document Type'''. | # So, it gets classified as a "Federal W-4" '''Document Type'''. | ||
| | | | ||
[[File:Rules- | [[File:2023 Rules Based (Classification Method) - 2023 01 About 08.png]] | ||
|- | |- | ||
|valign=top| | |valign=top| | ||
Line 134: | Line 136: | ||
For example, we wouldn't expect to see the web address "www.iowa.gov/tax" on a federal form. But it definitely is on this state W-4 form. | For example, we wouldn't expect to see the web address "www.iowa.gov/tax" on a federal form. But it definitely is on this state W-4 form. | ||
| | | | ||
[[File:Rules- | [[File:2023 Rules Based (Classification Method) - 2023 01 About 09.png]] | ||
|- | |- | ||
|valign=top| | |valign=top| | ||
Line 148: | Line 150: | ||
|} | |} | ||
| | | | ||
[[File:Rules- | [[File:2023 Rules Based (Classification Method) - 2023 01 About 10.png]] | ||
|- | |- | ||
|valign=top| | |valign=top| | ||
Line 157: | Line 159: | ||
# The '''''Negative Extractor''''' prevents the document from being assigned a "Federal W-4" '''Document Type'''. | # The '''''Negative Extractor''''' prevents the document from being assigned a "Federal W-4" '''Document Type'''. | ||
| | | | ||
[[File:Rules- | [[File:2023 Rules Based (Classification Method) - 2023 01 About 11.png]] | ||
|} | |} | ||
<!--#endregion--> | |||
<!--#region Mixed Classification--> | |||
=== Mixed Classification: Combining Training-Based and Rules-Based Approaches === | === Mixed Classification: Combining Training-Based and Rules-Based Approaches === | ||
Line 167: | Line 170: | ||
Many of the best classification strategies involve combining the training-based ''Lexical'' method with a rules-based approach. | Many of the best classification strategies involve combining the training-based ''Lexical'' method with a rules-based approach. | ||
<!--#endregion--> | |||
Revision as of 16:45, 24 October 2023
WIP |
This article is a work-in-progress or created as a placeholder for testing purposes. This article is subject to change and/or expansion. It may be incomplete, inaccurate, or stop abruptly. This tag will be removed upon draft completion. |
The Rules-Based Classification Method is one of three methods of classifying documents available to Grooper. This approach uses Data Extractors to find key words, phrases, or other text-based information in order to identify and classify a document (assigning a Document Type to the Document Folder).
The Rules-Based method classifies documents according to "rules" using the Positive Extractor and Negative Extractor properties of Document Type objects in a Content Model.
If an extractor set as the Positive Extractor returns a result on a document, the document would be classified as that Document Type. The Negative Extractor works the opposite way. If this extractor finds a result on a document, it would be prevented from being classified as that Document Type. This type of classification can be useful if a document's structure is always predictable or has a fixed title heading or form number and OCR errors are not an issue.
Classification is then performed using the Classify activity, using the extraction rules established by the Positive and Negative Extractor properties of Document Types in a Content Model.
About
Rules-Based classification can be enabled and configured on any Content Model object. To do so, select the Classification Method property and select Rules-Based.

What are you classifying? - Document Types
Classification is all about distinguishing one kind of document from another. The Rules-Based method uses extractors to do this. Positive Extractors positively identify documents of a certain kind. Negative Extractors prevent a document from being identified as a certain kind of document.
This may be obvious, but before you can assign these positive and negative extractor rules, you have to give a name to that type of document you're wanting to classify. In Grooper, we do this by adding Document Type objects to a Content Model.
For example, imagine you have a collection of human resources documents. For each employee, you'll have a variety of different kinds of documents in their HR file, such as a federal W-4 form, their employment application, various documents pertaining to their health insurance enrollment, and more. In order to distinguish those documents from one another (in other words, classify them), you will need to add a Document Type for each kind of document.
Take the four kinds of documents seen here: A federal W-4, an employee data sheet, an FSA enrollment form, and a pension enrollment form
Federal W-4 | Employee Data Sheet | FSA Enrollment Form | Pension Enrollment Form |
![]() |
![]() |
![]() |
![]() |
If we want to classify a Batch of these documents and assign the federal W-4 documents a "Federal W-4" classification and so on, we would need to create a Content Model and add one Document Type for each kind of document. So, the W-4s would get a "W-4" Document Type. The FSAs would get an "FSA" Document Type, and so on.
A Content Model is how we determine the taxonomy of our documents set. Taxonomy is just a fancy word for a classification scheme. Zoological taxonomy organizes organisms into a classification scheme, from domain all the way down to species. We do much the same thing with documents and a Content Model. The whole set of HR documents belong to the top level in the hierarchy, the Content Model itself. Each individual kind of document are represented by Document Types, which are next level down in that hierarchy. Each one is distinct from each other, but still part of the Content Model's scope. Just like insects, spiders, and lobsters are distinct from each other but are all part of the "arthropod" zoological class. |
How are the documents classified? - Positive and Negative Extractors
The "rules" in the Rules-Based method are determined by extraction results set on the Positive and Negative Extractor properties of a Document Type.
|
Positive Extractor Rules
If the Positive Extractor returns at least one result on a document, it will be assigned the Document Type. One common approach to Rules-Based classification is "title-matching". Often, a document's title will correspond to what Document Type you want to classify it as.
|
|||
If we create a Data Type returning this title, we can then assign it to the Positive Extractor property of the "Data Information Sheet" Document Type.
|
|||
When a Batch is classified, Grooper will execute each Document Type's Positive Extractor against the unclassified Document Folder. When a Positive Extractor returns a result on the document, the Document Folder will be assigned the corresponding Document Type (Specifically, it will be assigned that Document Type as its Content Type property).
|
Negative Extractor Rules
The Negative Extractor property works the opposite way. If the extractor set here produces a result, the Document Folder will be prevented from being assigned the Document Type. For example, let's say we use the following Value Pattern for our Positive Extractor to classify these Federal W-4 documents as a "Federal W-4" Document Type: This certainly produces results for this Federal W-4 form here. It will accurately positively classify this document as a "Federal W-4" Document Type. However, this is a very general pattern. If the characters "W-4" are found on any other document that isn't a Federal W-4 form, it will classify it as a "Federal W-4" Document Type. |
|||
Upon classification, if the Positive Extractor returns a result on a document it shouldn't, it's going to produce a false positive result. The document seen here is not a Federal W-4. It's a state W-4 form specific to the state of Iowa. However, since we were so loose with our regular expression pattern, all that had to match to produce a positive result are the characters "W-4".
|
|||
However, with a Negative Extractor, if we can match something on this known document that shouldn't be a "Federal W-4", we can point to that as a rule to prevent it from being classified as a "Federal W-4" Document Type. For example, we wouldn't expect to see the web address "www.iowa.gov/tax" on a federal form. But it definitely is on this state W-4 form. |
|||
If we can create an extractor to match and return that web address (or any other text unique to this Iowa W-4 that distinguishes it from the Federal W-4), we can assign it to the Negative Extractor of the "Federal W-4" Document Type. Even if the Positive Extractor produces a result, if the Negative Extractor also produces a result, it will override classification. The document will be prevented from being classified as a "Federal W-4" Document Type.
|
|||
If we classify this Batch with both the Positive and the Negative Extractors configured as described above, we get a different result.
|
Mixed Classification: Combining Training-Based and Rules-Based Approaches
Furthermore, a rules-based approach can be combined with a training-based approach when using the Lexical Classification Method. The Lexical method uses trained examples Document Types to classify documents. It uses a TF-IDF algorithm to weight the importance text features (such as words and phrases) based on this trained examples. However, even when choosing Lexical for the Classification Method, the Positive Extractor' and Negative Extractor properties are still present on Document Types.
Generally, the Positive Extractor's result will "win out" over training based classification results, because the Positive Extractor's confidence result (as a percentage value) will be higher than the document's similarity to the trained examples (as a percentage value) for a Document Type. This way, if you have a value which can be extracted that you know is going to be on a Document Type, you can take advantage of setting a Positive Extractor on the Document Type to classify them. For example, document titles are often used as "rules". If you can extract text to match a title to a corresponding Document Type, this is often a quick and easy way to classify a document. But, if that extractor fails for whatever reason (because of bad OCR or a new title not matching the extractor's regex), you have training data which can act as a backup classification method.
Many of the best classification strategies involve combining the training-based Lexical method with a rules-based approach.