2023:Output Extractor Key (Property): Difference between revisions

Latest revision as of 10:03, 22 November 2024

This article is about an older version of Grooper.

Information may be out of date and UI elements may have changed.

2025

2023

The Output Extractor Key property is another weapon in the arsenal of powerful Grooper classification techniques. It allows pin Data Types to return results normalized in a way more beneficial to document classification.

You may download and import the file below into your own Grooper environment (version 2023) to follow along with this tutorial. This contains a Batch with the example document(s) discussed in this tutorial and a Project to configure.

About

Output Extractor Key is a property on a the Data Type extractor. It is exposed when the Collation property is set to Individual. When the Output Extractor Key is set to True, each output value will be set to a key representing the name of the extractor which produced the match. It is useful when extracting non-word classification features.

The main purpose of this property is to supplement the capabilities of Grooper's classification technology. When using Lexical classification, a Content Model must use an extractor to collect the lexical features upon training. A common use case is to have the extractor collect words, which is beneficial when the semantic content of a document is varied among examples, and indicative of their type. However, this breaks down when a document consists mainly of repeated types of information. Take, for example, a bank statement. With no keywords present on the document, the only way to properly classify the document is to recognize that it contains a high frequency of transaction line items. It would be highly impractical to train Grooper to understand every variation of a transaction line item.

This is where the Output Extractor Key property comes into play. In using this property one can establish an extractor that will pattern match the various transaction line item formats on the document, and return A SINGLE output for each result, such as "feature_transaction", instead of the myriad returned results from the pattern match. This is then fed to the classification engine. With this approach a document containing a high frequency of "transaction" features, let's say ... 50, will be treated as though it contained 50 separate occurrences of the phrase "feature_transaction".

How To

Following is an example of how to configure a Data Type to use the Output Extractor Key property, then configure the Content Model to leverage it for the purposes of classification. In this example are a few different document formats, but all are Mineral Ownership Reports. In spite of their different formats, because they have similar content, the use of the aforementioned extractor will make their classification quite simple.

⚠	Some of the tabs in this tutorial are longer than the others. Please scroll to the bottom of each step's tab before going to the step.

Understanding the Content ModelConfiguring the 'Feature Tagging' ExtractorTraining and Classifying the Batch

Understanding the Content Model

The purpose of this Content Model is to classify the one Document Type it contains. Its Classification Method property is set to Lexical and it is referencing...

...this Data Type, which is configured to find words.
- This is often a configuration used to attempt to classify documents. In many cases, this works very well, but for these types of documents it will do a poor job.
These three Data Types are configured to find a selected set of data that is highly prominent on these documents, and will be used to supplement the main lexical extractor.

Configuring the 'Feature Tagging' Extractor

Add, as a child object to the CLAS - Unigrams Data Type, another Data Type and name it Features. Configure the Referenced Extractors property to point at the 3 Data Types mentioned previously: Address, Section, and TownshipRange. Save and test, and notice the Results list view. Many results are being returned, and you can see from which extractor they're being returned from. It's important to note how these results are being returned at the moment. While the Data Type is getting the information we want, we can't use these results for training purposes because they're so varied. In their current state, they're worthless, but we can transform the information to work for us!
Expand the Collation property, and set the Output Extractor Key property to True. Save and test again. Notice how the results are now different. This is the really critical part. Instead of all the varied results returned from any one extractor, they're now unified and returned as one result: feature_`{name of data type}`. The thing to understand about why this is so important is the effects it will have on the weightings created during training. All the results from, say, the Section Data Type will now be returned as feature_section, instead of all the varied results they were before, like Section 19: SE/4, except for two acres described in Book 883, Page 229 and and Section 9: N2, N2S2, S2SW, SWSE etc. This will cause a high friquency of that specific feature, and as a result, will give great significance in the weightings.
Select the parent Data Type named CLAS - Unigrams. The results from the parent are now the combination of it and its children. The Deduplicate By property being set to Area lets the longer results from the child extractor supercede the individual short words.

Back to top to continue to next tab

Training and Classifying the Batch

Select the Classify Step of the Batch Process. Make sure the correct Content Model is selected for the Content Model Scope. Click on the "Classification Tester" tab.
Right click the document you want to train and then hover over the "Classification" option. Click "Train As...".
When the "Train As" window pops up, use the drop down to select which Document Type you want to train the document as, and click "Execute".
After training the document, select all of the documents, right click, and hover over the "Classification" option again. Click "Classify...".
When the "Classify" window pops up, select the correct Content Model from the Content Type drop down box and click "Execute".
All of the documents should now be appropriately classified after only training one document. The features that the CLAS - Unigrams (and it's children) collected are highly prominant on these documents, therefore they have a high feature count and easily classify this type of document.
Click on the "MOR" Document Type. If you click on the weightings tab, you can see the features that were collected and how they are weighted when classifying.

@@ Line 1: / Line 1: @@
+{{AutoVersion}}
 [[File:Output_extractor_key_000.png|right]]
-<blockquote style="font-size:14pt">Also known as "feature tagging" or "data tagging", this is another weapon in the arsenal of powerful '''Grooper''' classification techniques.</blockquote>
+<blockquote>{{#lst:Glossary|Output Extractor Key}}</blockquote>
 <!--#region About-->
+{|class="download-box"
+|
+[[File:Asset 22@4x.png]]
+|
+You may download and import the file below into your own Grooper environment (version 2023) to follow along with this tutorial.  This contains a '''Batch''' with the example document(s) discussed in this tutorial and a '''Project''' to configure.
+* [[Media:2023 Wiki Output-Extractor-Key Project.zip]]
+* [[Media:2023 Wiki Output-Extractor-Key Batch.zip]]
+|}
 ==About==
 '''''Output Extractor Key''''' is a property on a the '''[[Data Type]]''' extractor. It is exposed when the '''''[[Collation Provider|Collation]]''''' property is set to ''Individual''. When the '''''Output Extractor Key''''' is set to ''True'', each output value will be set to a key representing the name of the extractor which produced the match. It  is useful when extracting non-word classification features.
@@ Line 12: / Line 23: @@
 <br clear = all>
+<!--#endregion-->
-{|cellpadding="10" cellspacing="5"
-|-
-|style="font-size:14pt; color:#f89420; border: 2px solid #f89420; width:40px"|[[File:Asset 22@4x.png]]
-|style="border: 2px solid #f89420"|
-You may download and import the file below into your own Grooper environment (version 2023) to follow along with this tutorial.  This contains a '''Batch''' with the example document(s) discussed in this tutorial and a '''Project''' to configure.
-* [[Media:File:2023 Projects - Wiki - Output Extractor Key.zip]]
-* [[Media:File:2023 Batches - Wiki - Output Extractor Key.zip]]
-|}
-<!--#endregion-->
 ==How To==
 {|
@@ Line 30: / Line 32: @@
 |}
-{|cellpadding="10" cellspacing="5"
+{|class="attn-box"
-|-style="background-color:#f89420; color:white"
+|
-|style="font-size:14pt"|'''!'''||Some of the tabs in this tutorial are longer than the others.  Please scroll to the bottom of each step's tab before going to the step.
+&#9888;
+|
+Some of the tabs in this tutorial are longer than the others.  Please scroll to the bottom of each step's tab before going to the step.
 |}
@@ Line 39: / Line 43: @@
 <tab name="Understanding the Content Model" style="margin:25px">
 ====Understanding the Content Model====
-{|
+{|cellpadding=10 cellspacing=5
-| style="padding:25px; vertical-align:top; width:35%" |
+| style="vertical-align:top; width:40%" |
 The purpose of this '''Content Model''' is to classify the one '''[[Document Type]]''' it contains. Its '''Classification Method''' property is set to '''''Lexical''''' and it is referencing...
 # ...this '''Data Type''', which is configured to find words.
@@ Line 52: / Line 56: @@
 <tab name="Configuring the 'Feature Tagging' Extractor" style="margin:25px">
 ====Configuring the 'Feature Tagging' Extractor====
-{| class="wikitable"
+{|cellpadding=10 cellspacing=5
-| style="padding:25px; vertical-align:center; width:35%" |
+| style="vertical-align:top; width:40%" |
 # Add, as a child object to the '''CLAS - Unigrams''' '''Data Type''', another '''Data Type''' and name it '''Features'''.
 # Configure the '''''Referenced Extractors''''' property to point at the 3 '''Data Types''' mentioned previously: '''Address''', '''Section''', and '''TownshipRange'''.
 # Save and test, and notice the '''Results''' list view. Many results are being returned, and you can see from which extractor they're being returned from.
 #* It's important to note how these results are being returned at the moment. While the '''Data Type''' is getting the information we want, we can't use these results for training purposes because they're so varied. In their current state, they're worthless, but we can transform the information to work for us!
-|| [[File:2023 Output Extractor Key - 2023 01 How To 02.png]]
+|
+[[File:2023 Output Extractor Key - 2023 01 How To 02.png]]
 |-
-| style="padding:25px; vertical-align:center; width:35%" |
+|valign=top|
 # Expand the '''''Collation''''' property, and set the '''''Output Extractor Key''''' property to ''True''. Save and test again.</li>
 # Notice how the results are now different. This is the really critical part. Instead of all the varied results returned from any one extractor, they're now unified and returned as one result: ''<span style="color:#0001fd">feature_<code><span style="color:#ff00ff">{name of data type}</span></code></span>''.
 #* The thing to understand about why this is so important is the effects it will have on the weightings created during training. All the results from, say, the '''Section''' '''Data Type''' will now be returned as ''<span style="color:#0001fd">feature_section</span>'', instead of all the varied results they were before, like ''<span style="color:#0001fd">Section 19: SE/4, except for two acres described in Book 883, Page 229 and</span>'' and ''<span style="color:#0001fd">Section 9: N2, N2S2, S2SW, SWSE</span>'' etc. This will cause a high friquency of that specific feature, and as a result, will give great significance in the weightings.
-|| [[File:2023 Output Extractor Key - 2023 01 How To 03.png]]
+|
+[[File:2023 Output Extractor Key - 2023 01 How To 03.png]]
 |-
-| style="padding:25px; vertical-align:center; width:35%" |
+|valign=top|
 # Select the parent '''Data Type''' named '''CLAS - Unigrams'''.</li>
 #* The results from the parent are now the combination of it and its children.
 # The '''''Deduplicate By''''' property being set to ''Area'' lets the longer results from the child extractor supercede the individual short words.
-|| [[File:2023 Output Extractor Key - 2023 01 How To 04.png]]
+|
+[[File:2023 Output Extractor Key - 2023 01 How To 04.png]]
 |}
 <span style="font-size:14pt">'''[[Output Extractor Key#How To|Back to top to continue to next tab]]'''</span>
@@ Line 78: / Line 85: @@
 <tab name="Training and Classifying the Batch" style="margin:25px">
 ====Training and Classifying the Batch====
-{| class="wikitable"
+{|cellpadding=10 cellspacing=5
-| style="padding:25px; vertical-align:center; width:35%" |
+| style="vertical-align:top; width:40%" |
 # Select the '''Classify Step''' of the '''Batch Process'''.
 # Make sure the correct '''Content Model''' is selected for the '''''Content Model Scope'''''.
 # Click on the "Classification Tester" tab.
-|| [[File:2023 Output Extractor Key - 2023 01 How To 05.png]]
+|
+[[File:2023 Output Extractor Key - 2023 01 How To 05.png]]
 |-
-| style="padding:25px; vertical-align:center; width:35%" |
+|valign=top|
 # Right click the document you want to train and then hover over the "Classification" option.
 # Click "Train As...".
-|| [[File:2023 Output Extractor Key - 2023 01 How To 06.png]]
+|[[File:2023 Output Extractor Key - 2023 01 How To 06.png]]
 |-
-| style="padding:25px; vertical-align:center; width:35%" |
+|valign=top|
 # When the "Train As" window pops up, use the drop down to select which '''Document Type''' you want to train the document as, and click "Execute".
-|| [[File:2023 Output Extractor Key - 2023 01 How To 07.png]]
+|
+[[File:2023 Output Extractor Key - 2023 01 How To 07.png]]
 |-
-| style="padding:25px; vertical-align:center; width:35%" |
+|valign=top|
 # After training the document, select all of the documents, right click, and hover over the "Classification" option again.</li>
 # Click "Classify...".
-|| [[File:2023 Output Extractor Key - 2023 01 How To 08.png]]
+|[[File:2023 Output Extractor Key - 2023 01 How To 08.png]]
 |-
-| style="padding:25px; vertical-align:center; width:35%" |
+|valign=top|
 # When the "Classify" window pops up, select the correct '''Content Model''' from the '''''Content Type''''' drop down box and click "Execute".
-|| [[File:2023 Output Extractor Key - 2023 01 How To 09.png]]
+|
+[[File:2023 Output Extractor Key - 2023 01 How To 09.png]]
 |-
-| style="padding:25px; vertical-align:center; width:35%" |
+|valign=top|
 # All of the documents should now be appropriately classified after only training one document.
 #* The features that the CLAS - Unigrams (and it's children) collected are highly prominant on these documents, therefore they have a high feature count and easily classify this type of document.
-|| [[File:2023 Output Extractor Key - 2023 01 How To 10.png]]
+|[[File:2023 Output Extractor Key - 2023 01 How To 10.png]]
 |-
-| style="padding:25px; vertical-align:center; width:35%" |
+|valign=top|
 # Click on the "MOR" '''Document Type'''.
 # If you click on the weightings tab, you can see the features that were collected and how they are weighted when classifying.
@@ Line 117: / Line 127: @@
 <br/>
 <!--#endregion-->
-If you would like a completed version of content linked above, and walked through in this article to compare against yours you can download it [[Media:Output_Extractor_Key_complete.zip|here]]. ''This file was exported from and meant for use in Grooper 2.9''