Output Extractor Key (Property): Difference between revisions

From Grooper Wiki
No edit summary
Line 1: Line 1:
<blockquote style="font-size:14pt">Also known as "data tagging", this is another weapon in the arsenal of powerful '''Grooper''' classification techniques.</blockquote>
<blockquote style="font-size:14pt">Also known as "data tagging", this is another weapon in the arsenal of powerful '''Grooper''' classification techniques.</blockquote>
A '''[[Content Model]]''' and accompanying '''[[Batch]]''' for what will be built can be found '''[[Media:Output_Extractor_Key.zip|here]]'''. It is not required to download to understand this article, but can be helpful because it can be used to follow along with the content of this article. ''This file was exported from and meant for use in Grooper 2.9''


==About==
==About==
'''''Output Extractor Key''''' is a property on a the '''Data Type''' extractor. It is exposed when the '''''Collation''''' property is set to ''Individual''. When the '''''Output Extractor Key''''' is a property is set to ''True'', each output value will be set to a key representing the name of the extractor which produced the match. It  is useful when extracting non-word classification features.
'''''Output Extractor Key''''' is a property on a the '''[[Data Type]]''' extractor. It is exposed when the '''''[[Collation]]''''' property is set to ''Individual''. When the '''''Output Extractor Key''''' is a property is set to ''True'', each output value will be set to a key representing the name of the extractor which produced the match. It  is useful when extracting non-word classification features.
<br/><br/>
<br/><br/>
The main purpose of this property is to supplement the capabilities of '''Grooper's''' classification technology. When using ''Lexical'' classification, a '''Content Model''' must use an extractor to collect the lexical features upon training. A common use case is to have the extractor collect words, which is beneficial when the semantic content of a document is varied among examples, and indicative of their type. However, this breaks down when a document consists mainly of repeated types of information. Take, for example, a bank statement. With no keywords present on the document, the only way to properly classify the document is to recognize that it contains a high frequency of transaction line items. It would be highly impractical to train '''Grooper''' to understand every variation of a transaction line item.
The main purpose of this property is to supplement the capabilities of '''Grooper's''' classification technology. When using ''Lexical'' classification, a '''Content Model''' must use an extractor to collect the lexical features upon training. A common use case is to have the extractor collect words, which is beneficial when the semantic content of a document is varied among examples, and indicative of their type. However, this breaks down when a document consists mainly of repeated types of information. Take, for example, a bank statement. With no keywords present on the document, the only way to properly classify the document is to recognize that it contains a high frequency of transaction line items. It would be highly impractical to train '''Grooper''' to understand every variation of a transaction line item.
Line 12: Line 14:
| style="padding:25px; vertical-align:top" |
| style="padding:25px; vertical-align:top" |
Following is an example of how to configure a '''Data Type''' to use the '''''Output Extractor Key''''' property, then configure the '''Content Model''' to leverage it for the purposes of classification. In this example are a few different document formats, but all are Mineral Ownership Reports. In spite of their different formats, because they have similar content, the use of the aforementioned extractor will make their classification quite simple.
Following is an example of how to configure a '''Data Type''' to use the '''''Output Extractor Key''''' property, then configure the '''Content Model''' to leverage it for the purposes of classification. In this example are a few different document formats, but all are Mineral Ownership Reports. In spite of their different formats, because they have similar content, the use of the aforementioned extractor will make their classification quite simple.
|| [[File:database_export_001.gif]]
|| [[File:output_extractor_key_001.gif]]
|}
|}

Revision as of 17:00, 30 April 2020

Also known as "data tagging", this is another weapon in the arsenal of powerful Grooper classification techniques.

A Content Model and accompanying Batch for what will be built can be found here. It is not required to download to understand this article, but can be helpful because it can be used to follow along with the content of this article. This file was exported from and meant for use in Grooper 2.9

About

Output Extractor Key is a property on a the Data Type extractor. It is exposed when the Collation property is set to Individual. When the Output Extractor Key is a property is set to True, each output value will be set to a key representing the name of the extractor which produced the match. It is useful when extracting non-word classification features.

The main purpose of this property is to supplement the capabilities of Grooper's classification technology. When using Lexical classification, a Content Model must use an extractor to collect the lexical features upon training. A common use case is to have the extractor collect words, which is beneficial when the semantic content of a document is varied among examples, and indicative of their type. However, this breaks down when a document consists mainly of repeated types of information. Take, for example, a bank statement. With no keywords present on the document, the only way to properly classify the document is to recognize that it contains a high frequency of transaction line items. It would be highly impractical to train Grooper to understand every variation of a transaction line item.

This is where the Output Extractor Key property comes into play. In using this property one can establish an extractor that will pattern match the various transaction line item formats on the document, and return A SINGLE output for each result, such as "feature_transaction", instead of the myriad returned results from the pattern match. This is then fed to the classification engine. With this approach a document containing a high frequency of "transaction" features, let's say ... 50, will be treated as though it contained 50 separate occurrences of the phrase "feature_transaction".

How To

Following is an example of how to configure a Data Type to use the Output Extractor Key property, then configure the Content Model to leverage it for the purposes of classification. In this example are a few different document formats, but all are Mineral Ownership Reports. In spite of their different formats, because they have similar content, the use of the aforementioned extractor will make their classification quite simple.