2023:Training Batch (Concept): Difference between revisions

From Grooper Wiki
No edit summary
Tag: Reverted
No edit summary
Tag: Reverted
Line 13: Line 13:
* [[Media:2023 Wiki Training-Batch Batch.zip]]
* [[Media:2023 Wiki Training-Batch Batch.zip]]
|}
|}
== Glossary ==
<u><big>'''Batch Process Step'''</big></u>: {{#lst:Glossary|Batch Process Step}}
<u><big>'''Batch Process'''</big></u>: {{#lst:Glossary|Batch Process}}
<u><big>'''Batch'''</big></u>: {{#lst:Glossary|Batch}}
<u><big>'''Classification Method'''</big></u>: {{#lst:Glossary|Classification Method}}
<u><big>'''Classification'''</big></u>: {{#lst:Glossary|Classification}}
<u><big>'''Classify'''</big></u>: {{#lst:Glossary|Classify}}
<u><big>'''Content Category'''</big></u>: {{#lst:Glossary|Content Category}}
<u><big>'''Content Model'''</big></u>: {{#lst:Glossary|Content Model}}
<u><big>'''Content Type'''</big></u>: {{#lst:Glossary|Content Type}}
<u><big>'''Extract'''</big></u>: {{#lst:Glossary|Extract}}
<u><big>'''Form Type'''</big></u>: {{#lst:Glossary|Form Type}}
<u><big>'''Lexical'''</big></u>: {{#lst:Glossary|Lexical}}
<u><big>'''Project'''</big></u>: {{#lst:Glossary|Project}}
<u><big>'''Review'''</big></u>: {{#lst:Glossary|Review}}
<u><big>'''TF-IDF'''</big></u>: {{#lst:Glossary|TF-IDF}}
<u><big>'''Training Batch'''</big></u>: {{#lst:Glossary|Training Batch}}


==About==
==About==
Line 110: Line 77:


<br/>
<br/>
== Glossary ==
<u><big>'''Batch Process Step'''</big></u>: {{#lst:Glossary|Batch Process Step}}
<u><big>'''Batch Process'''</big></u>: {{#lst:Glossary|Batch Process}}
<u><big>'''Batch'''</big></u>: {{#lst:Glossary|Batch}}
<u><big>'''Classification Method'''</big></u>: {{#lst:Glossary|Classification Method}}
<u><big>'''Classification'''</big></u>: {{#lst:Glossary|Classification}}
<u><big>'''Classify'''</big></u>: {{#lst:Glossary|Classify}}
<u><big>'''Content Category'''</big></u>: {{#lst:Glossary|Content Category}}
<u><big>'''Content Model'''</big></u>: {{#lst:Glossary|Content Model}}
<u><big>'''Content Type'''</big></u>: {{#lst:Glossary|Content Type}}
<u><big>'''Extract'''</big></u>: {{#lst:Glossary|Extract}}
<u><big>'''Form Type'''</big></u>: {{#lst:Glossary|Form Type}}
<u><big>'''Lexical'''</big></u>: {{#lst:Glossary|Lexical}}
<u><big>'''Project'''</big></u>: {{#lst:Glossary|Project}}
<u><big>'''Review'''</big></u>: {{#lst:Glossary|Review}}
<u><big>'''TF-IDF'''</big></u>: {{#lst:Glossary|TF-IDF}}
<u><big>'''Training Batch'''</big></u>: {{#lst:Glossary|Training Batch}}

Revision as of 10:33, 27 August 2024

This article is about an older version of Grooper.

Information may be out of date and UI elements may have changed.

202520232.90
This is a snippet of the Grooper Design Studio UI showing the Training Set batch.

The Training Batch is a special inventory_2 Batch created when training document examples using the Lexical classification method. The Training Batch service two purposes: (1) It is a Batch that holds all previously trained folder Batch Folders. Designers can go to this Batch to view these documents and copy and paste them into other Batches if needed. (2) Batch Folders in the Training Batch will be used to re-train the Content Model's classification data when the Rebuild Training command is executed.

You may download the ZIP(s) below and upload it into your own Grooper environment (version 2023). The first contains a Project with resources used in examples throughout this article. The second contains one or more Batches of sample documents.

About

During the development and training of TF-IDF Classification in a Grooper Content Model, it can be challenging to keep track of all of the samples that are used during training. In previous versions, each trained sample was stored under each content type in the Grooper Design Studio node tree. In 2.9, the trained samples are stored both under each content type and in the Training Set batch.


How To

Following is an example of how to perform TF-IDF classification that creates the Training Set batch. In the example content model, there are five different content types from three different batches.

Some of the tabs in this tutorial are longer than the others. Please scroll to the bottom of each step's tab before going to the step.

Prerequisites

Following these steps assumes you already have a content model created up with Lexical set as the Classification Method and the appropriate Text Feature Extractor selected. In the example content model, this property is set to Words(Stemmed)

Train Content Types

  1. You will need to create a Batch Process with a "Classify" Batch Process Step.
  2. Go to the "Classification Tester" tab.
  3. Right click on the folder you wish to train and hover over "Classification".
  4. Click on "Train As..." to train the document.

Repeat these steps for remaining Content Types. In the example Content Model provided, train all five Content Types from all three example batches

Review the Training Set batch

As you train your content types you will see a Training Set batch begin to populate under the Local Resources folder.
A Grooper Designer can review and keep track off all of the documents that have been used for TF-IDF Classification training. As the development cycle of Classification continues and more content types are training, the Grooper Designer now has a single place to review, test and perform regression testing for Classification


It is important to understand that the Training Set is not tied to the actual TF-IDF Weightings that is associated with the Content Type or Content Category. Purging the training from a Content Model does not delete any or all of the documents in the Training Set. Conversely, deleting a document from the Training Set does not remove or purge anyTF-IDF Weightings from a Content Type or Content Category.


Glossary

Batch Process Step: edit_document Batch Process Steps are specific actions within a settings Batch Process sequence. Each Batch Process Step performs an "Activity" specific to some document processing task. These Activities will either be a "Code Activity" or "Review" activities. Code Activities are automated by Activity Processing services. Review activities are executed by human operators in the Grooper user interface.

  • Batch Process Steps are frequently referred to as simply "steps".
  • Because a single Batch Process Step executes a single Activity configuration, they are often referred to by their referenced Activity as well. For example, a "Recognize step".

Batch Process: settings Batch Process nodes are crucial components in Grooper's architecture. A Batch Process is the step-by-step processing instructions given to a inventory_2 Batch. Each step is comprised of a "Code Activity" or a Review activity. Code Activities are automated by Activity Processing services. Review activities are executed by human operators in the Grooper user interface.

  • Batch Processes by themselves do nothing. Instead, they execute edit_document Batch Process Steps which are added as children nodes.
  • A Batch Process is often referred to as simply a "process".

Batch: inventory_2 Batch nodes are fundamental in Grooper's architecture. They are containers of documents that are moved through workflow mechanisms called settings Batch Processes. Documents and their pages are represented in Batches by a hierarchy of folder Batch Folders and contract Batch Pages.

Classification Method:

Classification: Classification is the process of identifying and organizing documents into categorical types based on their content or layout. Classification is key for efficient document management and data extraction workflows. Grooper has different methods for classifying documents. These include methods that use machine learning and text pattern recognition. In a Grooper Batch Process, the Classify Activity will assign a Content Type to a folder Batch Folder.

Classify: unknown_document Classify is an Activity that "classifies" folder Batch Folders in a inventory_2 Batch by assigning them a description Document Type.

  • Classification is key to Grooper's document processing. It affects how data is extracted from a document (during the Extract activity) and how Behaviors are applied.
  • Classification logic is controlled by a Content Model's "Classify Method". These methods include using text patterns, previously trained document examples, and Label Sets to identify documents.

Content Category: collections_bookmark A Content Category is a container for other Content Category or description Document Type nodes in a stacks Content Model. Content Categories are often used simply as organizational buckets for Content Models with large numbers of Document Types. However, Content Categories are also necessary to create branches in a Content Model's classification taxonomy, allowing for more complex Data Element inheritance and Behavior inheritance.

Content Model: stacks Content Model nodes define a classification taxonomy for document sets in Grooper. This taxonomy is defined by the collections_bookmark Content Categories and description Document Types they contain. Content Models serve as the root of a Content Type hierarchy, which defines Data Element inheritance and Behavior inheritance. Content Models are crucial for organizing documents for data extraction and more.

Content Type: Content Types are a class of node types used used to classify folder Batch Folders. They represent categories of documents (stacks Content Models and collections_bookmark Content Categories) or distinct types of documents (description Document Types). Content Types serve an important role in defining Data Elements and Behaviors that apply to a document.

Extract: export_notes Extract is an Activity that retrieves information from folder Batch Folder documents, as defined by Data Elements in a data_table Data Model. This is how Grooper locates unstructured data on your documents and collects it in a structured, usable format.

Form Type: two_pager Form Types represent trained variations of a description Document Type. These nodes store machine learning training data for Lexical and Visual document classification methods.

Lexical: "Lexical" is a Classify Method that classifies folder Batch Folders based on the text content of trained document examples. This is achieved through the statistical analysis of word frequencies that identify description Document Types.

Project: package_2 Projects are the primary containers for configuration nodes within Grooper. The Project is where various processing objects such as stacks Content Models, settings Batch Processes, profile objects are stored. This makes resources easier to manage, easier to save, and simplifies how node references are made in a Grooper Repository.

Review: person_search Review is an Activity that allows user attended review of Grooper's results. This allows human operators to validate processed contract Batch Page and folder Batch Folder content using specialized user interfaces called "Viewers". Different kinds of Viewers assist users in reviewing Grooper's image processing, document classification, data extraction and operating document scanners.

TF-IDF: TF-IDF stands for term frequency-inverse document frequency. It is a statistical calculation intended to reflect how important a word is to a document within a document set (or "corpus"). It is how Grooper uses machine learning for training-based document classification (via the Lexical method) and data extraction (via the input Field Class extractor).

Training Batch: The Training Batch is a special inventory_2 Batch created when training document examples using the Lexical classification method. The Training Batch service two purposes: (1) It is a Batch that holds all previously trained folder Batch Folders. Designers can go to this Batch to view these documents and copy and paste them into other Batches if needed. (2) Batch Folders in the Training Batch will be used to re-train the Content Model's classification data when the Rebuild Training command is executed.