Training Batch (Concept): Difference between revisions

From Grooper Wiki
No edit summary
No edit summary
 
(24 intermediate revisions by 3 users not shown)
Line 1: Line 1:
 
{{Migrated}}
[[File:Training Batch01.PNG|thumb|275px|This is a snippet of the '''Grooper Design Studio UI''' showing the '''Training Set''' batch.]]
{{2023:{{PAGENAME}}}}
 
<blockquote style="font-size:14pt">
The '''Training Set''' batch is more convenient way to work with all of the samples a Content Model has been trained against
</blockquote>
</p><br/>
A '''Content Model''' and accompanying set of '''Batches''' can be found by '''[[:Media:Training_Batch_Example.zip|following this link]]''' and downloading the provided file. It is not required to download to understand this article, but can be helpful because it can be used to follow along with the steps in this article. ''This file was exported from and meant for use in Grooper 2.9''
 
==About==
During the development and training of Classification of a Grooper Content Model, it can be challenging to keep track of all of the samples you have trained TF-IDF against.  In previous versions, each trained sample was stored under each content type in the Grooper Design Studio node tree.  In 2.9, the trained samples are stored both under each content type and in the '''Training Set''' batch.
 
==How To==
{|
| style="padding:25px; vertical-align:top" |
Following is an example of how to perform TF-IDF classification that creates the '''Training Set''' batch. In this example are five different content types from three different batches. Format A and B follow a similar enough structure and will not use an override to extract. Format C is different enough that it will override the default extractor to get its data.
|| [[File:data_element_overrides_001.gif]]
|}
 
{|cellpadding="10" cellspacing="5"
|-style="background-color:#f89420; color:white"
|style="font-size:14pt"|'''!'''||Some of the tabs in this tutorial are longer than the others.  Please scroll to the bottom of each step's tab before going to the step.
|}
 
<tabs style="margin:20px">
<tab name="Prerequisites" style="margin:25px">
====Understanding the Forms====
{|
| style="padding:25px; vertical-align:top" |
In the image on the right you can see that Format A and Format B have values that can be captured with simple ''key-value pair'' extractors. In fact, the '''Value Extractor''' '''Data Type''' for the '''Value 1''' '''Data Field''' is simply referencing two different extractors, each in either a horizontal or vertical layout. This one extractor is successfully extracting values for both Format A and Format B, but it fails on Format C because that form is using OMR boxes instead of YES/NO values.
|| [[File:Training Batch02.PNG]]
|}
</tab>
<tab name="Setting up the Override" style="margin:25px">
====Setting up the Override====
{| class="wikitable"
| style="padding:25px; |
Setting up a '''Data Element Override''' is quite simple.<br/>
1. Select a '''Content Type''', in this case, a '''Document Type'''.<br/>
*Yes, '''Data Element Overrides''' can be applied to '''Content Categories'''.
2. Select the '''Data Element Overrides''' tab.<br/>
3. Select a '''Data Element''' you want to set overrides for, in this case a '''Data Field'''.
*Note that '''Data Elements''' that have had properties overridden will be underlined.
|| [[File:data_element_overrides_003a.png|1000px]]
|-
| style="padding:25px; |
4. Select the '''Property Overrides''' tab.<br/>
5. Adjust properties. Any and all properties available to the '''Data Element''' can be changed here. The default settings will reflect that of the original '''Data Element''', changing any property is considered to be ''overriding'' the property as established on the original '''Data Element'''.
*In this example the properties were adjusted to allow for the reading of the OMR box, as opposed to the default setup which leveraged two different ''key-value pair'' extractors.<br/>
6. Click the '''Test Extraction''' button to see the results.
|| [[File:data_element_overrides_003b.png|1000px]]
|}
</tab>
<tab name="Testing the Results" style="margin:25px">
====Testing the Results====
{|
| style="padding:25px; vertical-align:top" |
The crux of this all is that you can now use the main '''Data Model''', with the same established '''Data Elements''', and get results from all the forms.<br/>
1. Click on the '''Data Model'''.<br/>
2. Click on the document you want to extract from.<br/>
3. Click '''Test Extraction'''<br/>
*Rinse and repeat for the other documents. Document Format C will now successfully extract due to the overrides.
<p/><br/>
It's important to note that because the '''Data Element Overrides''' are applied to a '''Content Type''' a document must be properly classified in order for the '''Data Model''' to know that overrides would be used for extraction for that document. You may be able to successfully test results from the '''Data Element Overrides''' interface without a classified document, but doing so on the '''Data Model''' will result in no extraction.
|| [[File:data_element_overrides_004.gif]]
|}
</tab>
</tabs>
<br/>
It is worth noting that one could have accomplished the above by simply making another extractor and set it up for OMR, then have the '''Value Extractor''' '''Data Types''' for each '''Data Field''' simply reference a third element. Overrides would not be necessary in that case. This example, however, sufficed to provide something to show. As with many things in '''Grooper''' there isn't always a ''right'' or ''wrong'' way. There is perhaps a ''best practice'', and in this case, making the third extractor would be the better thing to do.
</p>
A simpler, perhaps more common, example of where '''Data Element Overrides''' very much come in handy is with the visibility of '''Data Elements'''. On of the properties of a '''Data Element''' is the '''Visible''' property which is default ''True''. Imagine a '''Data Model''' that has five '''Data Fields''', and the '''Content Model''' has 3 '''Document Types'''. '''Document1''' uses '''Data Fields''' 1-3, '''Document2''' uses '''Data Fields''' 2-4, and '''Document3''' uses '''Data Fields''' 3-5. In '''Data Review''' you want to simplify the job for the person reviewing, so you do not want them to concern themselves with fields that are not relevant. To accomplish this you could use '''Data Element Overrides''' on each of the aforementioned hypothetical '''Document Types''' and set the '''Visibility''' property to ''False'' on all the fields you don't need. This would keep only relevant '''Data Fields''' visibile upon review.
 
==Version Differences==
Versions prior to '''Grooper 2.9''' had an initial concept version of overrides in the '''Data Element Profiles''' tab located on the '''Content Model''' or '''Document Type'''. These profiles only allowed modification to a limited number of properties on the data element, as opposed to '''Grooper 2.9''' where all properties can be overridden.
===Where Did Zonal Properties Go?===
All the zonal extraction properties are now set directly on the '''Data Element'''.

Latest revision as of 12:05, 28 August 2024

This article was migrated from an older version and has not been updated for the current version of Grooper.

This tag will be removed upon article review and update.

This article is about the current version of Grooper.

Note that some content may still need to be updated.

2025 20232.90
This is a snippet of the Grooper Design Studio UI showing the Training Set batch.

The Training Batch is a special inventory_2 Batch created when training document examples using the Lexical classification method. The Training Batch service two purposes: (1) It is a Batch that holds all previously trained folder Batch Folders. Designers can go to this Batch to view these documents and copy and paste them into other Batches if needed. (2) Batch Folders in the Training Batch will be used to re-train the Content Model's classification data when the Rebuild Training command is executed.

You may download the ZIP(s) below and upload it into your own Grooper environment (version 2023). The first contains a Project with resources used in examples throughout this article. The second contains one or more Batches of sample documents.

About

During the development and training of TF-IDF Classification in a Grooper Content Model, it can be challenging to keep track of all of the samples that are used during training. In previous versions, each trained sample was stored under each content type in the Grooper Design Studio node tree. In 2.9, the trained samples are stored both under each content type and in the Training Set batch.


How To

Following is an example of how to perform TF-IDF classification that creates the Training Set batch. In the example content model, there are five different content types from three different batches.

Some of the tabs in this tutorial are longer than the others. Please scroll to the bottom of each step's tab before going to the step.

Prerequisites

Following these steps assumes you already have a content model created up with Lexical set as the Classification Method and the appropriate Text Feature Extractor selected. In the example content model, this property is set to Words(Stemmed)

Train Content Types

  1. You will need to create a Batch Process with a "Classify" Batch Process Step.
  2. Go to the "Classification Tester" tab.
  3. Right click on the folder you wish to train and hover over "Classification".
  4. Click on "Train As..." to train the document.

Repeat these steps for remaining Content Types. In the example Content Model provided, train all five Content Types from all three example batches

Review the Training Set batch

As you train your content types you will see a Training Set batch begin to populate under the Local Resources folder.
A Grooper Designer can review and keep track off all of the documents that have been used for TF-IDF Classification training. As the development cycle of Classification continues and more content types are training, the Grooper Designer now has a single place to review, test and perform regression testing for Classification


It is important to understand that the Training Set is not tied to the actual TF-IDF Weightings that is associated with the Content Type or Content Category. Purging the training from a Content Model does not delete any or all of the documents in the Training Set. Conversely, deleting a document from the Training Set does not remove or purge anyTF-IDF Weightings from a Content Type or Content Category.