2023:Rules-Based (Classification Method): Difference between revisions

From Grooper Wiki
No edit summary
No edit summary
Tag: Reverted
Line 8: Line 8:


Classification is then performed using the '''[[Classify]]''' activity, using the extraction rules established by the '''''Positive''''' and '''''Negative Extractor''''' properties of '''Document Types''' in a '''Content Model'''.
Classification is then performed using the '''[[Classify]]''' activity, using the extraction rules established by the '''''Positive''''' and '''''Negative Extractor''''' properties of '''Document Types''' in a '''Content Model'''.
== Glossary ==
<u><big>'''Batch Folder'''</big></u>: {{#lst:Glossary|Batch Folder}}
<u><big>'''Batch'''</big></u>: {{#lst:Glossary|Batch}}
<u><big>'''Classification Method'''</big></u>: {{#lst:Glossary|Classification Method}}
<u><big>'''Classification'''</big></u>: {{#lst:Glossary|Classification}}
<u><big>'''Classify'''</big></u>: {{#lst:Glossary|Classify}}
<u><big>'''Content Model'''</big></u>: {{#lst:Glossary|Content Model}}
<u><big>'''Data Type'''</big></u>: {{#lst:Glossary|Data Type}}
<u><big>'''Document Type'''</big></u>: {{#lst:Glossary|Document Type}}
<u><big>'''Extract'''</big></u>: {{#lst:Glossary|Extract}}
<u><big>'''Find Barcode'''</big></u>: {{#lst:Glossary|Find Barcode}}
<u><big>'''Labeled OMR'''</big></u>: {{#lst:Glossary|Labeled OMR}}
<u><big>'''Lexical'''</big></u>: {{#lst:Glossary|Lexical}}
<u><big>'''Node Tree'''</big></u>: {{#lst:Glossary|Node Tree}}
<u><big>'''OCR'''</big></u>: {{#lst:Glossary|OCR}}
<u><big>'''Read Barcode'''</big></u>: {{#lst:Glossary|Read Barcode}}
<u><big>'''Reference'''</big></u>: {{#lst:Glossary|Reference}}
<u><big>'''Rules-Based'''</big></u>: {{#lst:Glossary|Rules-Based}}
<u><big>'''TF-IDF'''</big></u>: {{#lst:Glossary|TF-IDF}}
<u><big>'''Zonal OMR'''</big></u>: {{#lst:Glossary|Zonal OMR}}


== About ==
== About ==

Revision as of 12:43, 10 May 2024

This article is about an older version of Grooper.

Information may be out of date and UI elements may have changed.

202520232.80

The Rules-Based Classification Method employs "rules" defined on each description Document Type to classify folder Batch Folders. Positive Extractor and Negative Extractor properties are configured for each Document Type to positively or negatively associate a Batch Folder based on predefined criteria. Note where the Positive and Negative Extractors will impact all Classification Method results, the Rules-Based method classifies using only these properties and nothing else.

The Rules-Based method classifies documents according to "rules" using the Positive Extractor and Negative Extractor properties of Document Type objects in a Content Model.

If an extractor set as the Positive Extractor returns a result on a document, the document would be classified as that Document Type. The Negative Extractor works the opposite way. If this extractor finds a result on a document, it would be prevented from being classified as that Document Type. This type of classification can be useful if a document's structure is always predictable or has a fixed title heading or form number and OCR errors are not an issue.

Classification is then performed using the Classify activity, using the extraction rules established by the Positive and Negative Extractor properties of Document Types in a Content Model.

Glossary

Batch Folder: folder Batch Folder objects are defined as container objects within a inventory_2 Batch that are used to represent and organize both folders and pages. They can hold other Batch Folders or contract Batch Page objects as children. The Batch Folder acts as an organizational unit within a Batch, allowing for a structured approach to managing and processing a collection of documents.

  • Batch Folders are frequently referred to simply as "documents".

Batch: inventory_2 Batch objects are fundamental in Grooper's architecture as they are the containers of documents that get moved through Grooper's workflow mechanisms known as settings Batch Processes.

Classification Method: A stacks Content Model's Classification Method property determines the technique used for document classification. Classification sorts folder Batch Folders into categories (called "description Document Types"). Grooper's various Classification Methods can utilize text-based pattern matching, machine learning models, or other methodologies to identify and organize documents accurately.

Classification: Classification is the process of identifying and organizing documents into categorical types based on their content or layout. Classification is key for efficient document management and data extraction workflows. Grooper has different methods for classifying documents. These include methods that use machine learning and text pattern recognition. In a Grooper Batch Process, the Classify Activity will assign a Content Type to a folder Batch Folder.

Classify: unknown_document Classify is an Activity that "classifies" folder Batch Folders in a inventory_2 Batch by assigning them a Content Type (e.g. a description Document Type) using patterns, lexical understanding, or rules as defined by a stacks Content Model.

Content Model: stacks Content Model node objects define the taxonomy of document sets in terms of the description Document Type they contain. They also house the Data Elements that appear on each collections_bookmark Content Category and Document Type within them. Content Models serve as the root of a Content Type hierarchy and are crucial for organizing the different types of documents that Grooper can recognize and process.

Data Type: pin Data Type objects hold a collection of child, referenced, and locally defined Data Extractors and settings that manage how multiple (even differing) matches from Data Extractors are consolidated (via Collation) into a result set.

Document Type: description Document Type objects represent a distinct type of document, like an invoice or contract. Document Types are created as children of a stacks Content Model or a collections_bookmark Content Category and are used to classify individual folder Batch Folders. Each Document Type in the hierarchy defines the Data Elements and Behaviors that apply to Batch Folders of that specific classification.

Extract: export_notes Extract is an Activity that retrieves information from folder Batch Folder documents, as defined by Data Elements in a data_table Data Model. This is how Grooper locates unstructured data on your documents and collects it in a structured, usable format.

Find Barcode: Find Barcode is an Extractor Type that searches for and returns barcode values previously stored in a folder Batch Folder or contract Batch Page's layout data.

Note: Find Barcode differs slightly from Read Barcode. Read Barcode performs barcode recognition when the extractor executes. Find Barcode can only look up barcode data stored in the document or page's layout data. Find Barcode runs quicker than Read Barcode, but barcode values must have previously been collected in the Batch Process by the Image Processing or Recognize activities.

Labeled OMR: Labeled OMR is an Extractor Type used to output OMR checkbox labels. It determines whether labeled checkboxes are checked or not. If checked, it outputs the label(s) or a Boolean true/false value as the result.

Lexical: The Lexical Classification Method classifies folder Batch Folders based on the text content of trained document examples. This is achieved through the statistical analysis of word frequencies that identify description Document Types.

Node Tree: The Node Tree is the hierarchical list of Grooper node objects found in the left panel in the Design Page. It is the basis for navigation and creation in the Design Page.

OCR: OCR is stands for Optical Character Recognition. It allows text on paper documents to be digitized, in order to be searched or edited by other software applications. OCR converts typed or printed text from digital images of physical documents into machine readable, encoded text.

Read Barcode: Read Barcode is an Extractor Type that uses barcode recognition technology to read and extract values from barcodes found in the document content.

Note: Read Barcode differs slightly from Find Barcode. Read Barcode performs barcode recognition when the extractor executes. Find Barcode can only look up barcode data stored in the document or page's layout data. Find Barcode runs quicker than Read Barcode, but barcode values must have previously been collected in the Batch Process by the Image Processing or Recognize activities.

Reference: Reference is an Extractor Type used to reference an external extractor object within a Grooper property configuration. This allows users to create re-usable extractors and use the more complex pin Data Type and input Field Class extractors throughout Grooper.

Rules-Based: The Rules-Based Classification Method employs "rules" defined on each description Document Type to classify folder Batch Folders. Positive Extractor and Negative Extractor properties are configured for each Document Type to positively or negatively associate a Batch Folder based on predefined criteria.

Note where the Positive and Negative Extractors will impact all Classification Method results, the Rules-Based method classifies using only these properties and nothing else.

TF-IDF: TF-IDF stands for term frequency-inverse document frequency. It is a statistical calculation intended to reflect how important a word is to a document within a document set (or "corpus"). It is how Grooper uses machine learning for training-based document classification (via the Lexical method) and data extraction (via the input Field Class extractor).

Zonal OMR: Zonal OMR is an Extractor Type that reads one or more OMR checkboxes using manually-configured zones. The zone may be optionally fixed on the page or anchored to a static text value (such as a label).

BE AWARE: Zonal OMR is outdated compared to Labeled OMR and Ordered OMR. It requires the most manual setup of any OMR extractor to configure. Use this as a last resort when other OMR extractor options have been exhausted.

About

Rules-Based classification can be enabled and configured on any Content Model object. To do so, select the Classification Method property and select Rules-Based.

What are you classifying? - Document Types

Classification is all about distinguishing one kind of document from another. The Rules-Based method uses extractors to do this. Positive Extractors positively identify documents of a certain kind. Negative Extractors prevent a document from being identified as a certain kind of document.

This may be obvious, but before you can assign these positive and negative extractor rules, you have to give a name to that type of document you're wanting to classify. In Grooper, we do this by adding Document Type objects to a Content Model.

For example, imagine you have a collection of human resources documents. For each employee, you'll have a variety of different kinds of documents in their HR file, such as a federal W-4 form, their employment application, various documents pertaining to their health insurance enrollment, and more. In order to distinguish those documents from one another (in other words, classify them), you will need to add a Document Type for each kind of document.

Take the four kinds of documents seen here: A federal W-4, an employee data sheet, an FSA enrollment form, and a pension enrollment form

Federal W-4 Employee Data Sheet FSA Enrollment Form Pension Enrollment Form


If we want to classify a Batch of these documents and assign the federal W-4 documents a "Federal W-4" classification and so on, we would need to create a Content Model and add one Document Type for each kind of document. So, the W-4s would get a "W-4" Document Type. The FSAs would get an "FSA" Document Type, and so on.

A Content Model is how we determine the taxonomy of our documents set. Taxonomy is just a fancy word for a classification scheme. Zoological taxonomy organizes organisms into a classification scheme, from domain all the way down to species. We do much the same thing with documents and a Content Model.

The whole set of HR documents belong to the top level in the hierarchy, the Content Model itself. Each individual kind of document are represented by Document Types, which are next level down in that hierarchy. Each one is distinct from each other, but still part of the Content Model's scope. Just like insects, spiders, and lobsters are distinct from each other but are all part of the "arthropod" zoological class.

How are the documents classified? - Positive and Negative Extractors

The "rules" in the Rules-Based method are determined by extraction results set on the Positive and Negative Extractor properties of a Document Type.

  1. With a Document Type selected in the Node Tree...
  2. These properties will be found in the Classification properties in the property panel.

Positive Extractor Rules

If the Positive Extractor returns at least one result on a document, it will be assigned the Document Type. One common approach to Rules-Based classification is "title-matching". Often, a document's title will correspond to what Document Type you want to classify it as.


For example, the document here is titled "DATA INFORMATION SECTION", which is easily matched by the regular expression DATA INFORMATION SECTION.

If we create a Data Type returning this title, we can then assign it to the Positive Extractor property of the "Data Information Sheet" Document Type.

FYI

The Positive Extractor can be set to a variety of extraction options, including Reference, Text Pattern, Read Zone, Find Barcode, Read Barcode, Labeled OMR, Ordered OMR, and Zonal OMR.

While using the Reference option, pointing to a Data Type in the Node Tree, is the most common configuration, any extractor returning a result will positively classify the document.

When a Batch is classified, Grooper will execute each Document Type's Positive Extractor against the unclassified Document Folder. When a Positive Extractor returns a result on the document, the Document Folder will be assigned the corresponding Document Type (Specifically, it will be assigned that Document Type as its Content Type property).


For example, all the documents labeled "Data Information Sheet" here returned a value matching the "DATA INFORMATION SECTION" title our extractor located. All the documents labeled "Folder" are unclassified documents. Since the "Data Information Sheet" Document Type's Positive Extractor did not return a result, they were not assigned the "Data Information Sheet" Document Type (We also did not assign any Positive Extractors for any of the other Document Types').


To put it simpler:

  1. If the Positive Extractor returns a result, the Document Folder is classified as the Document Type.
  2. If the Positive Extractor does not return a result, the Document Folder remains unclassified.

Negative Extractor Rules

The Negative Extractor property works the opposite way. If the extractor set here produces a result, the Document Folder will be prevented from being assigned the Document Type.

For example, let's say we use the following Value Pattern for our Positive Extractor to classify these Federal W-4 documents as a "Federal W-4" Document Type: W-4

This certainly produces results for this Federal W-4 form here. It will accurately positively classify this document as a "Federal W-4" Document Type. However, this is a very general pattern. If the characters "W-4" are found on any other document that isn't a Federal W-4 form, it will classify it as a "Federal W-4" Document Type.

Upon classification, if the Positive Extractor returns a result on a document it shouldn't, it's going to produce a false positive result.

The document seen here is not a Federal W-4. It's a state W-4 form specific to the state of Iowa. However, since we were so loose with our regular expression pattern, all that had to match to produce a positive result are the characters "W-4".

  1. Those characters are sure enough on this document.
  2. So, it gets classified as a "Federal W-4" Document Type.

However, with a Negative Extractor, if we can match something on this known document that shouldn't be a "Federal W-4", we can point to that as a rule to prevent it from being classified as a "Federal W-4" Document Type.

For example, we wouldn't expect to see the web address "www.iowa.gov/tax" on a federal form. But it definitely is on this state W-4 form.

If we can create an extractor to match and return that web address (or any other text unique to this Iowa W-4 that distinguishes it from the Federal W-4), we can assign it to the Negative Extractor of the "Federal W-4" Document Type.

Even if the Positive Extractor produces a result, if the Negative Extractor also produces a result, it will override classification. The document will be prevented from being classified as a "Federal W-4" Document Type.

FYI

The Positive Extractor can be set to a variety of extraction options, including Reference, Text Pattern, Read Zone, Find Barcode, Read Barcode, Labeled OMR, Ordered OMR, and Zonal OMR.

While using the Reference option, pointing to a Data Type in the Node Tree, is the most common configuration, any extractor returning a result will positively classify the document.

If we classify this Batch with both the Positive and the Negative Extractors configured as described above, we get a different result.

  1. Even though the "Federal W-4" Document Type's Positive Extractor returns a result here, matching "W-4"
  2. The Negative Extractor returns a result here, matching "www.iowa.gov/tax"
  3. The Negative Extractor prevents the document from being assigned a "Federal W-4" Document Type.


Mixed Classification: Combining Training-Based and Rules-Based Approaches

Furthermore, a rules-based approach can be combined with a training-based approach when using the Lexical Classification Method. The Lexical method uses trained examples Document Types to classify documents. It uses a TF-IDF algorithm to weight the importance text features (such as words and phrases) based on this trained examples. However, even when choosing Lexical for the Classification Method, the Positive Extractor' and Negative Extractor properties are still present on Document Types.

Generally, the Positive Extractor's result will "win out" over training based classification results, because the Positive Extractor's confidence result (as a percentage value) will be higher than the document's similarity to the trained examples (as a percentage value) for a Document Type. This way, if you have a value which can be extracted that you know is going to be on a Document Type, you can take advantage of setting a Positive Extractor on the Document Type to classify them. For example, document titles are often used as "rules". If you can extract text to match a title to a corresponding Document Type, this is often a quick and easy way to classify a document. But, if that extractor fails for whatever reason (because of bad OCR or a new title not matching the extractor's regex), you have training data which can act as a backup classification method.

Many of the best classification strategies involve combining the training-based Lexical method with a rules-based approach.