2023.1:Content Model (Node Type): Difference between revisions

From Grooper Wiki
No edit summary
(13 intermediate revisions by 2 users not shown)
Line 1: Line 1:
{{stubs}}
{{AutoVersion}}
<section begin="glossary" />
 
<blockquote>
<blockquote>{{#lst:Glossary|Content Model}}</blockquote>
 
== About ==
== About ==
A '''Content Model''' is the digital representation in Grooper of a document set's content.  What content you want to glean from your documents is all set up within a '''Content Model''', including the system for classifying documents and what data you want to extract from them.
A '''Content Model''' is the digital representation in Grooper of a document set's content.  What content you want to glean from your documents is all set up within a '''Content Model''', including the system for classifying documents and what data you want to extract from them.
Line 10: Line 11:
* Document [[Classification]]
* Document [[Classification]]
* Data Extraction
* Data Extraction
<br>
 
<br>
 
{|class="download-box"
|
[[File:Asset 22@4x.png]]
|
You may download the ZIP(s) below and upload it into your own Grooper environment (version 2023). The first contains one or more '''Batches''' of sample documents.  The second contains one or more '''Projects''' with resources used in examples throughout this article.
* [[Media:2023.1_Wiki_Content-Model_Batch.zip]]
* [[Media:2023.1_Wiki_Content-Model_Project.zip]]
|}
 
== Uses of a Content Model ==
Let's look at how Document Classification and Data Extraction can be used on a '''Content Model''':
Let's look at how Document Classification and Data Extraction can be used on a '''Content Model''':
<br>
 
<tabs>
=== Document Classification ===
<tab name = "Document Classification">
Document Classification is an important task that the Content Model helps facilitate.  
Document Classification is an important task that the Content Model helps facilitate.  
{|
 
|valign=top|
# The very first property of a Content Model is the ''Classification Method''. This tells Grooper how to classify documents.  
# The very first property of a Content Model is the ''Classification Method''. This tells Grooper how to classify documents.  
# This can be done one of five ways:
# This can be done one of five ways:
Line 26: Line 35:
#* [[Rules-Based (Classification Method)|Rules-Based]]
#* [[Rules-Based (Classification Method)|Rules-Based]]
#* [[Visual (Classification Method)|Visual]]
#* [[Visual (Classification Method)|Visual]]
|
 
[[File:20231_Content_Model_About_Documentation_Classification_01.png]]
[[File:20231_Content_Model_About_Documentation_Classification_01.png]]
|-
 
|valign=top|
 
# Another important tenet of classification that is relied upon by both Grooper and the Content Model is the Document Type. This is a child object of the Content Model that is used to identify certain documents through positive extractors.  
# Another important tenet of classification that is relied upon by both Grooper and the Content Model is the Document Type. This is a child object of the Content Model that is used to identify certain documents through positive extractors.  
#* For more information on Document Types, click here: [https://wiki.grooper.com/index.php?title=Document_Type]
#* For more information on Document Types, click [[Document Type (Object)|here]]
# For Documents that are more difficult for Grooper to classify, or if you don't want to set up a Classification Method, you can set a Default Content Type.
# For Documents that are more difficult for Grooper to classify, or if you don't want to set up a Classification Method, you can set a Default Content Type.
#* This is optional. If you have multiple Document Types, it's best just to set a Classification Method on the Content Model.
#* This is optional. If you have multiple Document Types, it's best just to set a Classification Method on the Content Model.
# You will need to have a Document Type created in order to do so.
# You will need to have a Document Type created in order to do so.
|
 
[[File:20231_Content_Model_About_Documentation_Classification_02.png]]
[[File:20231_Content_Model_About_Documentation_Classification_02.png]]
|}
{|class="fyi-box"
{|class="fyi-box"
|-
|-
Line 46: Line 54:
|}
|}


</tab>
=== Data Extraction ===
<tab name = "Data Extraction">
Data Extraction is another important job for a Content Model. This tells Grooper what you want done with the data from your documents, where you want it to go, and how you want it handled.
Data Extraction is another important job for a Content Model. This tells Grooper what you want done with the data from your documents, where you want it to go, and how you want it handled.
<br>
<br>
<br>
<br>
{|
 
|valign=top|
# Select the Behaviors property.
# Select the Behaviors property.
# This will open the List of Behaviors window.
# This will open the List of Behaviors window.
Line 58: Line 64:
# To add Behaviors, press the Add button.
# To add Behaviors, press the Add button.
# From the drop-down list, select your desired Behavior, based upon how you want to extract your data.
# From the drop-down list, select your desired Behavior, based upon how you want to extract your data.
|
 
[[File:20231_Content_Model_About_Data_Extraction_01.png]]
[[File:20231_Content_Model_About_Data_Extraction_01.png]]
|}
 
</tab>
</tabs>
== Brief Note on Document Types ==
== Brief Note on Document Types ==
Document Types are child objects of a Content Model. One cannot classify without a Document Type. The Classification Method on a Content Model may tell Grooper how to classify, but the Document Type tells Grooper what label to slap on the document.
'''Document Types''' are child objects of a '''Content Model'''. One cannot ''classify'' without a '''Document Type'''. The ''Classification Method'' on a '''Content Model''' may tell Grooper how to ''classify'', but the '''Document Type''' tells Grooper what label to slap on the document.
{|
<br>
|valign=top|
<br>
# To add a Document Type, right-click the Content Model in the Node Tree.
# To add a '''Document Type''', right-click the '''Content Model''' in the Node Tree.
# Select Add, then Document Type.
# Select Add, then Document Type.
|
 
[[File:20231_Content_Model_A_Brief_Note_on_Document_Types_00.png]]
[[File:20231_Content_Model_A_Brief_Note_on_Document_Types_00.png]]
|-
 
|
 
# This will create the Document Type.
# This will create the '''Document Type'''.
# Take note of the Allow Training property.  
# Extractors are a property that Grooper uses to help in identifying and classifying documents as different '''Document Types'''.
# Extractors are a property that Grooper uses to help in identifying and classifying documents as different Document Types.
#* '''[[Rules-Based (Classification Method)#Positive Extractor Rules|Positive Extractor]]'''s tell Grooper what to look for.
#* Positive Extractors tell Grooper what to look for.
#**In short, wherever the '''Positive Extractor''' extracts a piece of data that Grooper is told to look for, then the document is classified as whatever document type has been configured. This is a good tool to use whenever you have documents that are similar to one another, where classification could go awry.
#**In short, wherever the Positive Extractor extracts a piece of data that Grooper is told to look for, then the document is classified as whatever document type has been configured. This is a good tool to use whenever you have documents that are similar to one another, where classification could go awry.
#* Similarly, '''[[Rules-Based (Classification Method)#Negative Extractor Rules|Negative Extractors]]''' tell Grooper what to exclude from being classified as a potential '''Document Type'''.
#*  
# These Separation properties on the Document Type are ''only'' for '''[[ESP Auto Separation]]'''. '''ESP Auto Separation''' is a type of '''Separation Provider''' that both separates and classifies documents.
|
 
[[File:20231_Content_Model_A_Brief_Note_on_Document_Types_01.png]]
[[File:20231_Content_Model_A_Brief_Note_on_Document_Types_01(1).png]]
|}


== Wrap-Up ==
== Wrap-Up ==
Line 87: Line 90:


Hand-in-hand with the classification taxonomy, '''Content Models''' also define the hierarchical data structure for the documents and document set (via '''[[Data Model]]s''' of the various '''Content Types''' in the '''Content Model'''). The '''Data Models''' and their '''Data Elements''' define what data is extracted from documents and how that is accomplished.
Hand-in-hand with the classification taxonomy, '''Content Models''' also define the hierarchical data structure for the documents and document set (via '''[[Data Model]]s''' of the various '''Content Types''' in the '''Content Model'''). The '''Data Models''' and their '''Data Elements''' define what data is extracted from documents and how that is accomplished.
[[Category:Articles]]

Revision as of 11:56, 21 November 2024

This article is about an older version of Grooper.

Information may be out of date and UI elements may have changed.

20252023.1

stacks Content Model nodes define a classification taxonomy for document sets in Grooper. This taxonomy is defined by the collections_bookmark Content Categories and description Document Types they contain. Content Models serve as the root of a Content Type hierarchy, which defines Data Element inheritance and Behavior inheritance. Content Models are crucial for organizing documents for data extraction and more.

About

A Content Model is the digital representation in Grooper of a document set's content. What content you want to glean from your documents is all set up within a Content Model, including the system for classifying documents and what data you want to extract from them.

Content Models are the fundamental Content Type.  Other Content Types, such as Document Types, are established within a Content Model.  Content Models have two main purposes in Grooper:  


You may download the ZIP(s) below and upload it into your own Grooper environment (version 2023). The first contains one or more Batches of sample documents. The second contains one or more Projects with resources used in examples throughout this article.

Uses of a Content Model

Let's look at how Document Classification and Data Extraction can be used on a Content Model:

Document Classification

Document Classification is an important task that the Content Model helps facilitate.

  1. The very first property of a Content Model is the Classification Method. This tells Grooper how to classify documents.
  2. This can be done one of five ways:


  1. Another important tenet of classification that is relied upon by both Grooper and the Content Model is the Document Type. This is a child object of the Content Model that is used to identify certain documents through positive extractors.
    • For more information on Document Types, click here
  2. For Documents that are more difficult for Grooper to classify, or if you don't want to set up a Classification Method, you can set a Default Content Type.
    • This is optional. If you have multiple Document Types, it's best just to set a Classification Method on the Content Model.
  3. You will need to have a Document Type created in order to do so.

FYI

GPT Embeddings is a fairly new Classification Method and is still currently in beta.

Data Extraction

Data Extraction is another important job for a Content Model. This tells Grooper what you want done with the data from your documents, where you want it to go, and how you want it handled.

  1. Select the Behaviors property.
  2. This will open the List of Behaviors window.
  3. To add Behaviors, press the Add button.
  4. From the drop-down list, select your desired Behavior, based upon how you want to extract your data.

Brief Note on Document Types

Document Types are child objects of a Content Model. One cannot classify without a Document Type. The Classification Method on a Content Model may tell Grooper how to classify, but the Document Type tells Grooper what label to slap on the document.

  1. To add a Document Type, right-click the Content Model in the Node Tree.
  2. Select Add, then Document Type.


  1. This will create the Document Type.
  2. Extractors are a property that Grooper uses to help in identifying and classifying documents as different Document Types.
    • Positive Extractors tell Grooper what to look for.
      • In short, wherever the Positive Extractor extracts a piece of data that Grooper is told to look for, then the document is classified as whatever document type has been configured. This is a good tool to use whenever you have documents that are similar to one another, where classification could go awry.
    • Similarly, Negative Extractors tell Grooper what to exclude from being classified as a potential Document Type.
  3. These Separation properties on the Document Type are only for ESP Auto Separation. ESP Auto Separation is a type of Separation Provider that both separates and classifies documents.

Wrap-Up

Content Models define the classification taxonomy for a set of documents.  This means a list of distinct types of documents (via Document Types), their hierarchical structure within the Content Model (via optional Content Categories). How a document is classified is defined here as well (via the Classification Method and the Document Types).  

Hand-in-hand with the classification taxonomy, Content Models also define the hierarchical data structure for the documents and document set (via Data Models of the various Content Types in the Content Model). The Data Models and their Data Elements define what data is extracted from documents and how that is accomplished.