2023.1:Mixed Classification (Concept)

From Grooper Wiki

This article is about an older version of Grooper.

Information may be out of date and UI elements may have changed.

20252023.1


You may download the ZIP(s) below and upload it into your own Grooper environment (version 2023.1). The first contains one or more Batches of sample documents. The second contains one or more Projects with resources used in examples throughout this article.

ABOUT

When classifying a Batch with varying types of Documents, it's inevitable that you'll need to rely on more than just the Classification Method selected on the Content Model; especially when it comes to training-based Classification Methods. Grooper classifies Documents based on a confidence variable called a Similarity Score. If "Document Type A" scores higher than "Document Type B", then the classification for "Document Type A" wins, regardless of whether or not that classification is actually correct.

Thankfully, you can configure a rules-based Extractor on a particular Document Type to work in tandem with the training-based Classification Method. This "Mixed Classification" approach ensures that not only does the Batch get classified as a whole, but that false positives are avoided and documents that would have been misclassified are assigned the proper Document Type via the Positive Extractor.

The Problem

In this example, we have a Batch of documents that were classified using the Lexical Classification Method. Everything was classified properly, except for one document in the "Assignment" folder.

  1. Here, we have a document that was supposed to be classified as an 'Assignment' Document Type, but has been misclassified as a 'Memo' Document Type.
  2. This is due to the Similarity Score for Memo coming in at 61%, which is higher than the Assignment score that only came in at 55%. Unfortunately, since the incorrect Document Type scored higher, this document was classified as a Memo.
  3. We could train the Document as the "Assignment" Document Type and reclassify everything once more, but since the only real "Assignment" part of the document is the small, highlighted portion, it could cause issues for classification overall.
    • In fact, since this particular document bears a striking resemblance to a memo, it might mess up classification with the memos as well. So, training and reclassification is out of the question.

The Solution: Mixing Classification

So, we have an Assignment that was misclassified as a Memo. Training the document and reclassification could cause more issues in the long run, so what's to be done? This is where the concept of Mixing Classification comes in. We'll configure the Positive Extractor on the Document Type and have it work in tandem with the Classification Method so that every Document is properly classified.

  1. Select the Document Type which the problem Document was supposed to be classified as. In our case, that would be the "Assignment" Document Type.
  2. Remain on the Document Type tab.
  3. In addition to the Classification Method one can choose on the Content Model, the Document Type has its own Classification section. Here, one can configure both Positive and Negative Extractors. These Extractors work by either positively identifying data that you want to be extracted and thus used to classify a document as a particular Document Type, or to exclude text data from classification. Here, we'll be configuring the Positive Extractor. Select the hamburger icon at the far right of the property to expand the drop-down menu.
  4. Select List Match.




  1. In Local Entries, enter the title of the Document, "ASSIGNMENT OF OIL AND GAS LEASE"
  2. Since the title wraps around, we have Vertical Wrap enabled over on the Properties tab. This is what guarantees the title will be extracted. Otherwise, it would not be picked up.
  3. With that done, click OK.



With the Positive Extractor' configured, save and go back to the Classify Batch Process Step.

The Result

Now, we'll re-test classification on the contents of the Assignment folder, and our misclassified document will be corrected.

  1. Voila! Thanks to the configurations on the Positive Extractor, our Document has now been properly classified as an "Assignment" Document Type.



As you can see, by combining both the chosen Classification Method with the configured 'Positive Extractor on the Document Type, we can properly classify problematic documents that just the Classification Method or Positive Extractor alone would not be able to do. By mixing the rules-based Positive Extractor with the training-based Lexical Classification Method, we have ensured that each document is assigned the correct Document Type.

Glossary

AND: AND is a Collation Provider option for pin Data Type extractors. AND returns results only when each of its referenced or child extractors gets at least one hit, thus acting as a logical “AND” operator across multiple extractors.

Batch Process Step: edit_document Batch Process Steps are specific actions within a settings Batch Process sequence. Each Batch Process Step performs an "Activity" specific to some document processing task. These Activities will either be a "Code Activity" or "Review" activities. Code Activities are automated by Activity Processing services. Review activities are executed by human operators in the Grooper user interface.

  • Batch Process Steps are frequently referred to as simply "steps".
  • Because a single Batch Process Step executes a single Activity configuration, they are often referred to by their referenced Activity as well. For example, a "Recognize step".

Batch: inventory_2 Batch nodes are fundamental in Grooper's architecture. They are containers of documents that are moved through workflow mechanisms called settings Batch Processes. Documents and their pages are represented in Batches by a hierarchy of folder Batch Folders and contract Batch Pages.

Classification Method:

Classification: Classification is the process of identifying and organizing documents into categorical types based on their content or layout. Classification is key for efficient document management and data extraction workflows. Grooper has different methods for classifying documents. These include methods that use machine learning and text pattern recognition. In a Grooper Batch Process, the Classify Activity will assign a Content Type to a folder Batch Folder.

Classify: unknown_document Classify is an Activity that "classifies" folder Batch Folders in a inventory_2 Batch by assigning them a description Document Type.

  • Classification is key to Grooper's document processing. It affects how data is extracted from a document (during the Extract activity) and how Behaviors are applied.
  • Classification logic is controlled by a Content Model's "Classify Method". These methods include using text patterns, previously trained document examples, and Label Sets to identify documents.

Combine: Combine is a Collation Provider option for pin Data Type extractors. Combine combines instances from returned results based on a specified grouping, controlling how extractor results are assembled together for output.

Content Model: stacks Content Model nodes define a classification taxonomy for document sets in Grooper. This taxonomy is defined by the collections_bookmark Content Categories and description Document Types they contain. Content Models serve as the root of a Content Type hierarchy, which defines Data Element inheritance and Behavior inheritance. Content Models are crucial for organizing documents for data extraction and more.

Document Type: description Document Type nodes represent a distinct type of document, such as an invoice or a contract. Document Types are created as child nodes of a stacks Content Model or a collections_bookmark Content Category. They serve three primary purposes:

  1. They are used to classify documents. Documents are considered "classified" when the folder Batch Folder is assigned a Content Type (most typically, a Document Type).
  2. The Document Type's data_table Data Model defines the Data Elements extracted by the Extract activity (including any Data Elements inherited from parent Content Types).
  3. The Document Type defines all "Behaviors" that apply (whether from the Document Type's Behavior settings or those inherited from a parent Content Type).

Extract: export_notes Extract is an Activity that retrieves information from folder Batch Folder documents, as defined by Data Elements in a data_table Data Model. This is how Grooper locates unstructured data on your documents and collects it in a structured, usable format.

Lexical: "Lexical" is a Classify Method that classifies folder Batch Folders based on the text content of trained document examples. This is achieved through the statistical analysis of word frequencies that identify description Document Types.

List Match: List Match is a Value Extractor designed to return values matching one or more items in a defined list. By default, the List Match extractor does not use or require regular expression, but can be configured to utilize regular expression syntax.

Project: package_2 Projects are the primary containers for configuration nodes within Grooper. The Project is where various processing objects such as stacks Content Models, settings Batch Processes, profile objects are stored. This makes resources easier to manage, easier to save, and simplifies how node references are made in a Grooper Repository.

Vertical Wrap: