Rules Based (Classify Method): Difference between revisions

Revision as of 11:40, 13 October 2020

The Rules-Based Classification Method is one of three methods of classifying documents available to Grooper. This method classifies documents according to their text content, obtained from OCR or extracted native PDF text (via the Recognize activity). This method classifies documents according to "rules" using the Positive Extractor and Negative Extractor properties of Document Type objects in a Content Model.

If an extractor set as the Positive Extractor returns a result on a document, the document would be classified as that Document Type. The Negative Extractor works the opposite way. If this extractor finds a result on a document, it would be prevented from being classified as that Document Type. This type of classification can be useful if a document's structure is always predictable or has a fixed title heading or form number and OCR errors are not an issue.

Classification is then performed using the Classify activity, using the extraction rules established by the Positive and Negative Extractor properties of Document Types in a Content Model.

About

Rules-Based classification can be enabled and configured on any Content Model object. To do so, select the Classification Method property and select Lexical.

What are you classifying? - Document Types

Classification is all about distinguishing one kind of document from another. The Rules-Based method uses extractors to do this. Positive Extractors positively identify documents of a certain kind. Negative Extractors prevent a document from being identified as a certain kind of document.

This may be obvious, but before you can assign these positive and negative extractor rules, you have to give a name to that type of document you're wanting to classify. In Grooper, we do this by adding Document Type objects to a Content Model.

For example, imagine you have a collection of human resources documents. For each employee, you'll have a variety of different kinds of documents in their HR file, such as a federal W-4 form, their employment application, various documents pertaining to their health insurance enrollment, and more. In order to distinguish those documents from one another (in other words, classify them), you will need to add a Document Type for each kind of document.

Take the four kinds of documents seen here: A federal W-4, an employee data sheet, an FSA enrollment form, and a pension enrollment form

Federal W-4	Employee Data Sheet	FSA Enrollment Form	Pension Enrollment Form

If we want to classify a Batch of these documents and assign the federal W-4 documents a "Federal W-4" classification and so on, we would need to create a Content Model and add one Document Type for each kind of document. So, the W-4s would get a "W-4" Document Type. The FSAs would get an "FSA" Document Type, and so on.

A Content Model is how we determine the taxonomy of our documents set. Taxonomy is just a fancy word for a classification scheme. Zoological taxonomy organizes organisms into a classification scheme, from domain all the way down to species. We do much the same thing with documents and a Content Model.

The whole set of HR documents belong to the top level in the hierarchy, the Content Model itself. Each individual kind of document are represented by Document Types, which are next level down in that hierarchy. Each one is distinct from each other, but still part of the Content Model's scope. Just like insects, spiders, and lobsters are distinct from each other but are all part of the "arthropod" zoological class.

How are the documents classified? - Positive and Negative Extractors

The "rules" in the Rules-Based method are determined by extraction results set on the Positive and Negative Extractor properties of a Document Type.

@@ Line 1: / Line 1: @@
-This method classifies documents according to rules set up on the [[Document Type]] objects.  A [[Positive Extractor]] will be set to classify documents as the [[Document Type]].  Optionally, a [[Negative Extractor]] can be set to exclude documents from being classified as the [[Document Type]].  This type of classification can be useful if a document's structure is always predictable or has a fixed title heading or form number and [[OCR]] errors are not an issue.
+<blockquote style="font-size:14pt">
+The ''Rules-Based'' '''''[[Classification Method]]''''' is one of three methods of classifying documents available to Grooper.  This method classifies documents according to their text content, obtained from [[OCR]] or extracted native PDF text (via the [[Recognize]] activity).  This method classifies documents according to "rules" using the '''''Positive Extractor''''' and '''''Negative Extractor''''' properties of '''[[Document Type]]''' objects in a '''[[Content Model]]'''.
+</blockquote>
+If an extractor set as the '''''Positive Extractor''''' returns a result on a document, the document would be classified as that '''Document Type'''.  The '''''Negative Extractor''''' works the opposite way. If this extractor finds a result on a document, it would be ''prevented'' from being classified as that '''''Document Type'''''.  This type of classification can be useful if a document's structure is always predictable or has a fixed title heading or form number and [[OCR]] errors are not an issue.
+Classification is then performed using the '''[[Classify]]''' activity, using the extraction rules established by the '''''Positive''''' and '''''Negative Extractor''''' properties of '''Document Types''' in a '''Content Model'''.
+== About ==
+''Rules-Based'' classification can be enabled and configured on any '''Content Model''' object.  To do so, select the '''''Classification Method''''' property and select ''Lexical''.
+[[File:Rules-based-about-01.png|center|1000px]]
+=== What are you classifying? - Document Types ===
+Classification is all about distinguishing one kind of document from another.  The ''Rules-Based'' method uses extractors to do this.  '''''Positive Extractors''''' positively identify documents of a certain kind.  '''''Negative Extractors''''' prevent a document from being identified as a certain kind of document.
+This may be obvious, but before you can assign these positive and negative extractor rules, you have to give a ''name'' to that type of document you're wanting to classify.  In Grooper, we do this by adding '''Document Type''' objects to a '''Content Model'''.
+For example, imagine you have a collection of human resources documents.  For each employee, you'll have a variety of different kinds of documents in their HR file, such as a federal W-4 form, their employment application, various documents pertaining to their health insurance enrollment, and more.  In order to distinguish those documents from one another (in other words, classify them), you will need to add a '''Document Type''' for each kind of document.
+Take the four kinds of documents seen here:  A federal W-4, an employee data sheet, an FSA enrollment form, and a pension enrollment form
+{|cellpadding=10 cellspacing=5
+|-style="text-align:center"
+|Federal W-4||Employee Data Sheet||FSA Enrollment Form||Pension Enrollment Form
+|-
+|[[File:Lexical-classification-w4.png]]||[[File:Lexical-classification-datasheet.png]]||[[File:Lexical-classification-fsa.png]]||[[File:Lexical-classification-pension.png]]
+|}
+If we want to classify a '''[[Batch]]''' of these documents and assign the federal W-4 documents a "Federal W-4" classification and so on, we would need to create a '''Content Model''' and add one '''Document Type''' for each kind of document.  So, the W-4s would get a "W-4" '''Document Type'''.  The FSAs would get an "FSA" '''Document Type''', and so on.
+{|cellpadding=10 cellspacing=5
+|style="width:25%"|
+[[File:Lexical-classification-content-model.png]]
+|valign=top|
+A '''Content Model''' is how we determine the ''taxonomy'' of our documents set.  Taxonomy is just a fancy word for a classification scheme.  Zoological taxonomy organizes organisms into a classification scheme, from domain all the way down to species.  We do much the same thing with documents and a '''Content Model'''.
+The whole set of HR documents belong to the top level in the hierarchy, the '''Content Model''' itself.  Each individual kind of document are represented by '''Document Types''', which are next level down in that hierarchy.  Each one is distinct from each other, but still part of the '''Content Model's''' scope.  Just like insects, spiders, and lobsters are distinct from each other but are all part of the "arthropod" zoological class.
+|}
+=== How are the documents classified? - Positive and Negative Extractors ===
+The "rules" in the ''Rules-Based'' method are determined by extraction results set on the '''''Positive''''' and '''''Negative Extractor''''' properties of a '''Document Type'''.