2023:GPT Integration (Concept): Difference between revisions

From Grooper Wiki
m Randallkinard moved page 2023:GPT Integration to 2023:GPT Integration (Concept) without leaving a redirect: new naming convention
removed tabs and tables, corrected spelling, corrected links after namespace change // via Wikitext Extension for VSCode
Line 6: Line 6:
</blockquote>
</blockquote>
<section end="glossary" />
<section end="glossary" />
OpenAI's GPT model has made waves in the world of computing. Our '''Grooper''' developers recognized the potential for this to grow '''Grooper's''' capabilities. Adding its funcionality will allow for users to explore and find creative solutions for processing their documents using this advanced technology.
OpenAI's GPT model has made waves in the world of computing. Our '''Grooper''' developers recognized the potential for this to grow '''Grooper's''' capabilities. Adding its functionality will allow for users to explore and find creative solutions for processing their documents using this advanced technology.
 
 
== ABOUT ==


{|class="download-box"
{|class="download-box"
Line 19: Line 16:
* [[Media:2023_Wiki_GPT-Integration_Project.zip]]
* [[Media:2023_Wiki_GPT-Integration_Project.zip]]
|}
|}
<br>
 
== ABOUT ==
GPT (Generative Pre-trained Transformer) integration can be used for three things in '''Grooper''':
GPT (Generative Pre-trained Transformer) integration can be used for three things in '''Grooper''':
* '''[[GPT Integration#Extraction - GPT Complete|Extraction]]''' - Prompt the GPT model to return information it finds in a document.
* '''[[#GPT Complete (Extractor Type)|Extraction]]''' - Prompt the GPT model to return information it finds in a document.
* '''[[GPT Integration#Classification - GPT Embeddings|Classification]]''' - GPT has been trained against a massive corpus of information, which allows for a lot of potential when it comes to classifying documents. The idea here is that because it's seen so much, the amount of training required in '''Grooper''' should be less.
* '''[[#GPT Embeddings (Classification Method)|Classification]]''' - GPT has been trained against a massive corpus of information, which allows for a lot of potential when it comes to classifying documents. The idea here is that because it's seen so much, the amount of training required in '''Grooper''' should be less.
* '''[[GPT Integration#Lookup - GPT Lookup|Lookup]]''' - With a GPT lookup you can provide information collected from a model in '''Grooper''' as <code><span style="color:#ff00ff">@</span></code> variables in a prompt to have GPT generate data.
* '''[[#GPT Lookup (Lookup)|Lookup]]''' - With a GPT lookup you can provide information collected from a model in '''Grooper''' as <code><span style="color:#ff00ff">@</span></code> variables in a prompt to have GPT generate data.
<br>
 
In this article you will be shown how '''Grooper''' leverages GPT for the aforementioned methods. Some example use cases will be given to demonstrate a basic approach. Given the nature of the way this technology works, it will be up to the user to get creative about how this can be used for their needs.
In this article you will be shown how '''Grooper''' leverages GPT for the aforementioned methods. Some example use cases will be given to demonstrate a basic approach. Given the nature of the way this technology works, it will be up to the user to get creative about how this can be used for their needs.


Line 43: Line 41:


=== Location Data for Data Extraction ===
=== Location Data for Data Extraction ===
The final thing to consider is in regards to the ''GPT Complete'' '''''Value Extractor''''' type (more on this soon.) If you have used '''Grooper''' before then you are probably familiar with how a returned value is highlighted with a green box in the document viewer. One of the main strenghts of '''Grooper's''' text synthesis is that it collects location information for each character which allows this highlighting to occur. The GPT model does not consider location information when generating its results which means there will be no highlighting on the document for values collected with this method. The main impact this will have is on your ability to validate information returned by the GPT model.
The final thing to consider is in regards to the ''GPT Complete'' '''''Value Extractor''''' type (more on this soon.) If you have used '''Grooper''' before then you are probably familiar with how a returned value is highlighted with a green box in the document viewer. One of the main strengths of '''Grooper's''' text synthesis is that it collects location information for each character which allows this highlighting to occur. The GPT model does not consider location information when generating its results which means there will be no highlighting on the document for values collected with this method. The main impact this will have is on your ability to validate information returned by the GPT model.


== How To ==
== How To ==
Line 51: Line 49:
Grooper is able to integrate with OpenAI's GPT model because they have provided a web API. All we need in order use the Grooper GPT functionality is an API key. Here you will learn how to obtain an API key for yourself so you can start using GPT with Grooper.
Grooper is able to integrate with OpenAI's GPT model because they have provided a web API. All we need in order use the Grooper GPT functionality is an API key. Here you will learn how to obtain an API key for yourself so you can start using GPT with Grooper.


{|class="how-to-table"
|
# The first thing you should do is visit [https://platform.openai.com/ OpenAI API site] and login or create an account.
# The first thing you should do is visit [https://platform.openai.com/ OpenAI API site] and login or create an account.
# Once logged in, click the "Personal" menu in the top right.
# Once logged in, click the "Personal" menu in the top right.
# Within in this menu click the "View API Keys" option, which will take you to the "API keys" page.
# Within in this menu click the "View API Keys" option, which will take you to the "API keys" page.
|
 
[[Image:GPT Integration 001.png]]
[[Image:GPT Integration 001.png]]
|-
 
|
 
# <li value=4> On the "API keys" page, click the "+ Create new secret key" button, which will make an "API key generated" pop-up.
# <li value=4> On the "API keys" page, click the "+ Create new secret key" button, which will make an "API key generated" pop-up.
|
 
[[Image:GPT Integration 002.png]]
[[Image:GPT Integration 002.png]]
|-
 
|
 
# <li value=5> Highlight and copy, or click the copy button to copy the key string to your clipboard.
# <li value=5> Highlight and copy, or click the copy button to copy the key string to your clipboard.
#* A word of warning here. You '''WILL NOT''' get another chance to copy this string. You can always create a new one, but once you close this pop-up, you will not have another chance to copy the key string out.
#* A word of warning here. You '''WILL NOT''' get another chance to copy this string. You can always create a new one, but once you close this pop-up, you will not have another chance to copy the key string out.
|
 
[[Image:GPT Integration 003.png]]
[[Image:GPT Integration 003.png]]
|}


=== Extraction - GPT Complete ===
=== GPT Complete (Extractor Type) ===
''GPT Complete'' is an '''''Extractor Type''''' that was added to Grooper 2023. It is the setting you choose to leverage GPT integration on an extractor. Below are some examples of configuration and use. You should be able to follow along using the '''GPT Integration''' zip files ('''Batch''' and '''Project''' are included) that are included in this article. Begin by following along with the instructions. The details of the properties will be explained after.
''GPT Complete'' is an '''''Extractor Type''''' that was added to Grooper 2023. It is the setting you choose to leverage GPT integration on an extractor. Below are some examples of configuration and use. You should be able to follow along using the '''GPT Integration''' zip files ('''Batch''' and '''Project''' are included) that are included in this article. Begin by following along with the instructions. The details of the properties will be explained after.
<br><br>
 
It is also worth noting that the examples given below ARE NOT a comprehensive list. Provided are only a few examples of prompts used in extraction to get you thinking about what can be done. It is ''highly'' recommended that you not only reference the materials linked above, but also spend time experimenting and testing. Good luck!
It is also worth noting that the examples given below ARE NOT a comprehensive list. Provided are only a few examples of prompts used in extraction to get you thinking about what can be done. It is ''highly'' recommended that you not only reference the materials linked above, but also spend time experimenting and testing. Good luck!


<tabs>
==== Basic Configuration ====
<tab name="Basic Configuration">
{|cellpadding=10 cellspacing=5
|valign=top style="width:50%"|
# After importing the '''Grooper''' ZIP files provided with this course, expand the Node Tree out and select the '''Data Field''' named "Lessor".
# After importing the '''Grooper''' ZIP files provided with this course, expand the Node Tree out and select the '''Data Field''' named "Lessor".
# Click the drop-down menu for the '''''Value Extractor''''' property.
# Click the drop-down menu for the '''''Value Extractor''''' property.
# Select the ''GPT Complete'' option from the menu.
# Select the ''GPT Complete'' option from the menu.
|
 
[[Image:GPT Integration 004.png]]
[[Image:GPT Integration 004.png]]
|-
 
|valign=top style="width:50%"|
 
# <li value=4> With the '''''Value Extractor''''' property set, click the ellipsis button to open its configuration window (if you prefer, you can instead click the drop-down arrow to the left of the property to edit its properties without a pop-up window).
# <li value=4> With the '''''Value Extractor''''' property set, click the ellipsis button to open its configuration window (if you prefer, you can instead click the drop-down arrow to the left of the property to edit its properties without a pop-up window).
|
 
[[Image:GPT Integration 005.png]]
[[Image:GPT Integration 005.png]]
|-
 
|valign=top style="width:50%"|
 
# <li value=5> Start by entering your API key into the '''''API Key''''' property.
# <li value=5> Start by entering your API key into the '''''API Key''''' property.
# Click the "Browse Batches" button.
# Click the "Browse Batches" button.
# Select "GPT Complete Examples" '''Batch''' in the "GPT Integration - Batches" folder from the menu.
# Select "GPT Complete Examples" '''Batch''' in the "GPT Integration - Batches" folder from the menu.
|
 
[[Image:GPT Integration 006.png]]
[[Image:GPT Integration 006.png]]
|-
 
|valign=top style="width:50%"|
 
# <li value=8> Select "Lease (1)" from the '''Batch Viewer'''.
# <li value=8> Select "Lease (1)" from the '''Batch Viewer'''.
# Click the ellipsis button for the '''''Instructions''''' property to open its configuration window (if you prefer, you can insted simply type into the entry field of the property.)
# Click the ellipsis button for the '''''Instructions''''' property to open its configuration window (if you prefer, you can instead simply type into the entry field of the property.)
|
 
[[Image:GPT Integration 007.png]]
[[Image:GPT Integration 007.png]]
|-
 
|valign=top style="width:50%"|
 
# <li value=10> Type the string value <code>Who is the lessor?</code> into the editor.
# <li value=10> Type the string value <code>Who is the lessor?</code> into the editor.
# Click the "OK" button to accept and close this window.
# Click the "OK" button to accept and close this window.
|
 
[[Image:GPT Integration 008.png]]
[[Image:GPT Integration 008.png]]
|-
 
|valign=top style="width:50%"|
 
# <li value=12> When the previous window closes the extractor will immediately fire (assuming you have automatic testing enabled), and you will see a result returned in the "Results" list view.
# <li value=12> When the previous window closes the extractor will immediately fire (assuming you have automatic testing enabled), and you will see a result returned in the "Results" list view.
<br>
 
From a "prompt engineering" perspective the input we gave it is as basic as you can get. A result is returned, which is great, but it may not be the exact result that is desired. The value supplied is very conversational, which isn't necessarily a bad thing and is typical of an AI that's trained to emulate language, but considering how data is typically constructed in '''Grooper''', it's not quite right. If you break it down, the result given is really four values: the lessor's name, their marital status, their gender, and their location.  In this case the name of the lessor only will suffice.
From a "prompt engineering" perspective the input we gave it is as basic as you can get. A result is returned, which is great, but it may not be the exact result that is desired. The value supplied is very conversational, which isn't necessarily a bad thing and is typical of an AI that's trained to emulate language, but considering how data is typically constructed in '''Grooper''', it's not quite right. If you break it down, the result given is really four values: the lessor's name, their marital status, their gender, and their location.  In this case the name of the lessor only will suffice.
<br><br>
 
The next thing to tackle will be using some prompt engineering to get a more specific result.
The next thing to tackle will be using some prompt engineering to get a more specific result.
|
 
[[Image:GPT Integration 009.png]]
[[Image:GPT Integration 009.png]]
|}
 
<br>
==== Getting a More Specific Result with Prompt Engineering ====
<span style="font-size:14pt">'''[[GPT Integration#Extraction - GPT Complete|Back to top to continue to next tab]]'''</span>
</tab>
<tab name="Getting a More Specific Result with Prompt Engineering" style="margin:25px">
{|cellpadding=10 cellspacing=5
|valign=top style="width:50%"|
# Working with the same material as before, select the '''Data Field''' named "Lessee".
# Working with the same material as before, select the '''Data Field''' named "Lessee".
# Click the drop-down menu for the '''''Value Extractor''''' property.
# Click the drop-down menu for the '''''Value Extractor''''' property.
# Select ''GPT Complete'' from the drop-down menu.
# Select ''GPT Complete'' from the drop-down menu.
|
 
[[Image:GPT Integration 010.png]]
[[Image:GPT Integration 010.png]]
|}


{|cellpadding=10 cellspacing=5
 
|valign=top style="width:50%"|
# <li value=4> With the '''''Value Extractor''''' set, click the ellipsis button to open its configuration window (if you prefer, you can instead click the drop-down arrow to the left of the property to edit its properties without a pop-up window).
# <li value=4> With the '''''Value Extractor''''' set, click the ellipsis button to open its configuration window (if you prefer, you can instead click the drop-down arrow to the left of the property to edit its properties without a pop-up window).
|
 
[[Image:GPT Integration 011.png]]
[[Image:GPT Integration 011.png]]
|}


{|cellpadding=10 cellspacing=5
 
|valign=top style="width:50%"|
# <li value=5> Start by entering your API key into the '''''API Key''''' property.
# <li value=5> Start by entering your API key into the '''''API Key''''' property.
# Make sure "Lease (1)" is still selected in the '''Batch Viewer'''.
# Make sure "Lease (1)" is still selected in the '''Batch Viewer'''.
# Click the ellipsis button for the '''''Instructions''''' property to open its configuration window (if you prefer, you can insted simply type into the entry field of the property.)
# Click the ellipsis button for the '''''Instructions''''' property to open its configuration window (if you prefer, you can instead simply type into the entry field of the property.)
|
 
[[Image:GPT Integration 012.png]]
[[Image:GPT Integration 012.png]]
|}


{|cellpadding=10 cellspacing=5
 
|valign=top style="width:50%"|
# <li value=8> Type the string value <code>Who is the lessee?</code> into the editor.
# <li value=8> Type the string value <code>Who is the lessee?</code> into the editor.
# Click the "OK" button to accept and close this window.  
# Click the "OK" button to accept and close this window.  
|
 
[[Image:GPT Integration 013.png]]
[[Image:GPT Integration 013.png]]
|}


{|cellpadding=10 cellspacing=5
 
|valign=top style="width:50%"|
# <li value=10> When the previous window closes the extractor will immediately fire (assuming you have automatic testing enabled), and you will see a result returned in the "Results" list view.
# <li value=10> When the previous window closes the extractor will immediately fire (assuming you have automatic testing enabled), and you will see a result returned in the "Results" list view.
#* This is clearly a different result form the "Lessor", which is good, but let's address the issue mentioned previously. Let's use some simple "prompt engineer" to get the specific result desired.
#* This is clearly a different result form the "Lessor", which is good, but let's address the issue mentioned previously. Let's use some simple "prompt engineer" to get the specific result desired.
# Click the ellipsis button for the '''''Instructions''''' property to open its configuration window (if you prefer, you can insted simply type into the entry field of the property.)
# Click the ellipsis button for the '''''Instructions''''' property to open its configuration window (if you prefer, you can instead simply type into the entry field of the property.)
|
 
[[Image:GPT Integration 014.png]]
[[Image:GPT Integration 014.png]]
|}


{|cellpadding=10 cellspacing=5
 
|valign=top style="width:50%"|
# <li value=12> Add to the string value <code>Respond only with the lessee's name.</code>
# <li value=12> Add to the string value <code>Respond only with the lesse's name.</code>
# Click the "OK" button to accept and close this window.
# Click the "OK" button to accept and close this window.
|
 
[[Image:GPT Integration 015.png]]
[[Image:GPT Integration 015.png]]
|}


{|cellpadding=10 cellspacing=5
 
|valign=top style="width:50%"|
# <li value=14> This is a much better result than before. However, the period at the end is unnecessary and can be removed, again, by prompting the AI appropriately.
# <li value=14> This is a much better result than before. However, the period at the end is unnecessary and can be removed, again, by prompting the AI appropriately.
# Click the ellipsis button for the '''''Instructions''''' property to open its configuration window (if you prefer, you can insted simply type into the entry field of the property.)
# Click the ellipsis button for the '''''Instructions''''' property to open its configuration window (if you prefer, you can instead simply type into the entry field of the property.)
|
 
[[Image:GPT Integration 016.png]]
[[Image:GPT Integration 016.png]]
|}


{|cellpadding=10 cellspacing=5
 
|valign=top style="width:50%"|
# <li value=16> Add to the string value <code>Don't include control characters.</code>
# <li value=16> Add to the string value <code>Don't include control characters.</code>
# Click the "OK" button to accept and close this window.
# Click the "OK" button to accept and close this window.
|
 
[[Image:GPT Integration 017.png]]
[[Image:GPT Integration 017.png]]
|}


{|cellpadding=10 cellspacing=5
 
|valign=top style="width:50%"|
# <li value=18> Perfect! This is the exact value needed.
# <li value=18> Perfect! This is the exact value needed.
<br>
 
This is by no means anything but a simple prompt, but notice how giving context and being more specific alters the result. As a user learning this new technology, it's now time to start experimenting with your prompts and getting creative to get the results you're looking for.
This is by no means anything but a simple prompt, but notice how giving context and being more specific alters the result. As a user learning this new technology, it's now time to start experimenting with your prompts and getting creative to get the results you're looking for.
|
 
[[Image:GPT Integration 018.png]]
[[Image:GPT Integration 018.png]]
|}
 
<br>
==== Example: Full and Brief Document Summary ====
<span style="font-size:14pt">'''[[GPT Integration#Extraction - GPT Complete|Back to top to continue to next tab]]'''</span>
</tab>
<tab name="Example: Full and Brief Document Summary" style="margin:25px">
{|cellpadding=10 cellspacing=5
|valign=top style="width:50%"|
# Working with the same material as before, select the '''Data Field''' named "Full Summary".
# Working with the same material as before, select the '''Data Field''' named "Full Summary".
# Click the drop-down menu for the '''''Value Extractor''''' property.
# Click the drop-down menu for the '''''Value Extractor''''' property.
# Select ''GPT Complete'' from the drop-down menu.
# Select ''GPT Complete'' from the drop-down menu.
|
 
[[Image:GPT Integration 019.png]]
[[Image:GPT Integration 019.png]]
|}
 
{|cellpadding=10 cellspacing=5
 
|valign=top style="width:50%"|
# <li value=4> With the '''''Value Extractor''''' set, click the ellipsis button to open its configuration window (if you prefer, you can instead click the drop-down arrow to the left of the property to edit its properties without a pop-up window).
# <li value=4> With the '''''Value Extractor''''' set, click the ellipsis button to open its configuration window (if you prefer, you can instead click the drop-down arrow to the left of the property to edit its properties without a pop-up window).
|
 
[[Image:GPT Integration 020.png]]
[[Image:GPT Integration 020.png]]
|}
 
{|cellpadding=10 cellspacing=5
 
|valign=top style="width:50%"|
# <li value=5> Start by entering your API key into the '''''API Key''''' property.
# <li value=5> Start by entering your API key into the '''''API Key''''' property.
# Type <code>tldr</code> into the '''''Instructions''''' property.
# Type <code>tldr</code> into the '''''Instructions''''' property.
# Assuming you have automatic testing enabled, you will see a result returned in the "Results" list view. Click this result.
# Assuming you have automatic testing enabled, you will see a result returned in the "Results" list view. Click this result.
# Click the "Inspect" button.
# Click the "Inspect" button.
|
 
[[Image:GPT Integration 021.png]]
[[Image:GPT Integration 021.png]]
|}
 
{|cellpadding=10 cellspacing=5
 
|valign=top style="width:50%"|
# <li value=9> In the "Data Inspector" you will see the number of characters in the result.
# <li value=9> In the "Data Inspector" you will see the number of characters in the result.
# You will also see the full text of the summary.
# You will also see the full text of the summary.
# Right-click in a blank space to get a list of commands.
# Right-click in a blank space to get a list of commands.
# Make sure "Text Wrap" is enabled so that the text will wrap like it is in the screenshot.
# Make sure "Text Wrap" is enabled so that the text will wrap like it is in the screenshot.
|
 
[[Image:GPT Integration 022.png]]
[[Image:GPT Integration 022.png]]
|}
 
{|cellpadding=10 cellspacing=5
 
|valign=top style="width:50%"|
# <li value=13> After confirming the previous settings and closing windows, right-click the "Full Summary" '''Data Field''' to get a list of commands.
# <li value=13> After confirming the previous settings and closing windows, right-click the "Full Summary" '''Data Field''' to get a list of commands.
# Select the "Clone..." command.
# Select the "Clone..." command.
|
 
[[Image:GPT Integration 023.png]]
[[Image:GPT Integration 023.png]]
|}
 
{|cellpadding=10 cellspacing=5
 
|valign=top style="width:50%"|
# <li value=15> Name the clone "Brief Summary".
# <li value=15> Name the clone "Brief Summary".
# Confirm the clone by clicking the "Execute" button.
# Confirm the clone by clicking the "Execute" button.
|
 
[[Image:GPT Integration 024.png]]
[[Image:GPT Integration 024.png]]
|}
 
{|cellpadding=10 cellspacing=5
 
|valign=top style="width:50%"|
# <li value=17> With the clone made, click the ellipsis button of the '''''Value Extractor''''' property to open its configuration window (if you prefer, you can instead click the drop-down arrow to the left of the property to edit its properties without a pop-up window).
# <li value=17> With the clone made, click the ellipsis button of the '''''Value Extractor''''' property to open its configuration window (if you prefer, you can instead click the drop-down arrow to the left of the property to edit its properties without a pop-up window).
|
 
[[Image:GPT Integration 025.png]]
[[Image:GPT Integration 025.png]]
|}
 
{|cellpadding=10 cellspacing=5
 
|valign=top style="width:50%"|
# <li value=18> Add <code> in 100 words or less</code> to the '''''Instructions''''' property.
# <li value=18> Add <code> in 100 words or less</code> to the '''''Instructions''''' property.
# A result will be returned in the "Results" list view. Select this result.
# A result will be returned in the "Results" list view. Select this result.
#  Click the "Inspect" button.
#  Click the "Inspect" button.
|
 
[[Image:GPT Integration 026.png]]
[[Image:GPT Integration 026.png]]
|}
 
{|cellpadding=10 cellspacing=5
 
|valign=top style="width:50%"|
# <li value=21> In the "Data Inspector" you will now notice this result's length is much shorter.
# <li value=21> In the "Data Inspector" you will now notice this result's length is much shorter.
# The summary given is much shorter than the previous due to the additional instruction given in the prompt.
# The summary given is much shorter than the previous due to the additional instruction given in the prompt.
|
 
[[Image:GPT Integration 027.png]]
[[Image:GPT Integration 027.png]]
|}
 
<br>
==== Example: Sentiment Analysis ====
<span style="font-size:14pt">'''[[GPT Integration#Extraction - GPT Complete|Back to top to continue to next tab]]'''
</tab>
<tab name="Example: Sentiment Analysis" style="margin:25px">
{|cellpadding=10 cellspacing=5
|valign=top style="width:50%"|
# Working with the same material as before, select the '''Data Field''' named "Sentiment Analysis".
# Working with the same material as before, select the '''Data Field''' named "Sentiment Analysis".
# Click the drop-down menu for the '''''Value Extractor''''' property.
# Click the drop-down menu for the '''''Value Extractor''''' property.
# Select ''GPT Complete'' from the drop-down menu.
# Select ''GPT Complete'' from the drop-down menu.
|
 
[[Image:GPT Integration 028.png]]
[[Image:GPT Integration 028.png]]
|}
 
{|cellpadding=10 cellspacing=5
 
|valign=top style="width:50%"|
# <li value=4> With the '''''Value Extractor''''' set, click the ellipsis button to open its configuration window (if you prefer, you can instead click the drop-down arrow to the left of the property to edit its properties without a pop-up window).
# <li value=4> With the '''''Value Extractor''''' set, click the ellipsis button to open its configuration window (if you prefer, you can instead click the drop-down arrow to the left of the property to edit its properties without a pop-up window).
|
 
[[Image:GPT Integration 029.png]]
[[Image:GPT Integration 029.png]]
|}
 
{|cellpadding=10 cellspacing=5
 
|valign=top style="width:50%"|
# <li value=5> Start by entering your API key into the '''''API Key''''' property.
# <li value=5> Start by entering your API key into the '''''API Key''''' property.
# Click the ellipsis button for the '''''Instructions''''' property to open its configuration window (if you prefer, you can insted simply type into the entry field of the property.)
# Click the ellipsis button for the '''''Instructions''''' property to open its configuration window (if you prefer, you can instead simply type into the entry field of the property.)
|
 
[[Image:GPT Integration 030.png]]
[[Image:GPT Integration 030.png]]
|}
 
{|cellpadding=10 cellspacing=5
 
|valign=top style="width:50%"|
# <li value=7> Type the string <code>Is this document's sentiment positive, negative, or neutral? Respond with only the sentiment and no control characters.</code> into the editor.
# <li value=7> Type the string <code>Is this document's sentiment positive, negative, or neutral? Respond with only the sentiment and no control characters.</code> into the editor.
# Click the "OK" button to accept and close this window.
# Click the "OK" button to accept and close this window.
|
 
[[Image:GPT Integration 031.png]]
[[Image:GPT Integration 031.png]]
|}
 
{|cellpadding=10 cellspacing=5
 
|valign=top style="width:50%"|
# <li value=9> When the previous window closes, click on "Document (4)" in the '''Batch Viewer'''.
# <li value=9> When the previous window closes, click on "Document (4)" in the '''Batch Viewer'''.
# Assuming you have automatic testing enabled, you will see a result ("negative") returned in the "Results" list view.
# Assuming you have automatic testing enabled, you will see a result ("negative") returned in the "Results" list view.
|
 
[[Image:GPT Integration 032.png]]
[[Image:GPT Integration 032.png]]
|}
 
{|cellpadding=10 cellspacing=5
 
|valign=top style="width:50%"|
# <li value=11> Click on "Document (5)" in the '''Batch Viewer'''.
# <li value=11> Click on "Document (5)" in the '''Batch Viewer'''.
# Assuming you have automatic testing enabled, you will see a result ("positive") returned in the "Results" list view.
# Assuming you have automatic testing enabled, you will see a result ("positive") returned in the "Results" list view.
|
 
[[Image:GPT Integration 033.png]]
[[Image:GPT Integration 033.png]]
|}
<br>
<span style="font-size:14pt">'''[[GPT Integration#Extraction - GPT Complete|Back to top to continue to next tab]]'''
</tab>
</tabs>


==== Value Extractor Properties ====
==== Extractor Type Properties ====
Before moving on to seeing how the GPT model is used for classification in '''Grooper''' let's take a look at the properties used in the ''GPT Complete'' '''''Value Extractor'''''.
Before moving on to seeing how the GPT model is used for classification in '''Grooper''' let's take a look at the properties used in the ''GPT Complete'' extractor type.
 
===== API Key =====
===== API Key =====
You must fill this property with a valid API key from OpenAI in order to leverage GPT intergration with Grooper. See the '''[[GPT Integration#Obtain an API Key|Obtain an API Key]]''' section above for instruction on how to get a key.
You must fill this property with a valid API key from OpenAI in order to leverage GPT integration with Grooper. See the '''[[#Obtain an API Key|Obtain an API Key]]''' section above for instruction on how to get a key.
 
===== Model =====
===== Model =====
The API Key you use will determine which GPT models are available to you.  The different GPT models can affect the text generated based on their size, training data, capabilities, prompt engineering, and fine-tuning potential. GPT-3's larger size and training data, in particular, can potentially result in more sophisticated, diverse, and contextually appropriate text compared to GPT-2. However, the actual performance and quality of the generated text also depend on various other factors, such as prompt engineering, input provided, and specific use case requirements. GPT-4 is the latest version, as of this writing, and takes the GPT model evern further.
The API Key you use will determine which GPT models are available to you.  The different GPT models can affect the text generated based on their size, training data, capabilities, prompt engineering, and fine-tuning potential. GPT-3's larger size and training data, in particular, can potentially result in more sophisticated, diverse, and contextually appropriate text compared to GPT-2. However, the actual performance and quality of the generated text also depend on various other factors, such as prompt engineering, input provided, and specific use case requirements. GPT-4 is the latest version, as of this writing, and takes the GPT model even further.
 
===== Temperature =====
===== Temperature =====
In the context of text generation using language models like ChatGPT, the temperature parameter is a setting that controls the randomness or randomness of the generated text. It is used during the sampling process, where the model selects the next word or token to generate based on its predicted probabilities.
In the context of text generation using language models like ChatGPT, the temperature parameter is a setting that controls the randomness or randomness of the generated text. It is used during the sampling process, where the model selects the next word or token to generate based on its predicted probabilities.
Line 339: Line 285:


The choice of temperature parameter depends on the desired output. Higher values are useful when you want more creativity and diversity in the generated text, but it may lead to less coherent or nonsensical sentences. Lower values are useful when you want more deterministic and focused text, but it may result in repetitive or overly conservative output. It's a hyperparameter that can be tuned to achieve the desired balance between randomness and coherence in the generated text.
The choice of temperature parameter depends on the desired output. Higher values are useful when you want more creativity and diversity in the generated text, but it may lead to less coherent or nonsensical sentences. Lower values are useful when you want more deterministic and focused text, but it may result in repetitive or overly conservative output. It's a hyperparameter that can be tuned to achieve the desired balance between randomness and coherence in the generated text.
===== TopP =====
===== TopP =====
TopP, also known as "nucleus sampling" or "stochastic decoding with dynamic vocabulary," is a text generation technique that is used to improve the diversity and randomness of generated text. It is often used as an alternative to traditional approaches like random sampling or greedy decoding in language models, such as GPT-2 and GPT-3.
TopP, also known as "nucleus sampling" or "stochastic decoding with dynamic vocabulary," is a text generation technique that is used to improve the diversity and randomness of generated text. It is often used as an alternative to traditional approaches like random sampling or greedy decoding in language models, such as GPT-2 and GPT-3.
Line 351: Line 298:
# The remaining set of words or tokens whose probabilities fall within the threshold "p" is considered for sampling.
# The remaining set of words or tokens whose probabilities fall within the threshold "p" is considered for sampling.
By using TopP sampling, the model can generate text that is more diverse, as it allows for the possibility of selecting less frequent or rarer words or tokens, and it introduces randomness in the selection process. It can prevent the model from becoming overly deterministic or repetitive in its generated output, leading to more creative and varied text generation results.
By using TopP sampling, the model can generate text that is more diverse, as it allows for the possibility of selecting less frequent or rarer words or tokens, and it introduces randomness in the selection process. It can prevent the model from becoming overly deterministic or repetitive in its generated output, leading to more creative and varied text generation results.
===== Presence Penalty =====
===== Presence Penalty =====
The "presence penalty" is a technique used in text generation to encourage the model to generate more concise and focused outputs by penalizing the repetition of the same words or tokens in the generated text. It is a regularization technique that aims to reduce redundancy and promote diversity in the generated output.
The "presence penalty" is a technique used in text generation to encourage the model to generate more concise and focused outputs by penalizing the repetition of the same words or tokens in the generated text. It is a regularization technique that aims to reduce redundancy and promote diversity in the generated output.
Line 359: Line 307:


The magnitude of the presence penalty can be tuned to control the level of repetition allowed in the generated text. A higher penalty value would result in stricter avoidance of repetition, while a lower penalty value would allow for more repetition. The presence penalty is one of the techniques that can be used in combination with other regularization methods, such as temperature scaling, top-k sampling, or fine-tuning, to improve the quality and diversity of generated text.
The magnitude of the presence penalty can be tuned to control the level of repetition allowed in the generated text. A higher penalty value would result in stricter avoidance of repetition, while a lower penalty value would allow for more repetition. The presence penalty is one of the techniques that can be used in combination with other regularization methods, such as temperature scaling, top-k sampling, or fine-tuning, to improve the quality and diversity of generated text.
===== Frequency Penalty =====
===== Frequency Penalty =====
Frequency-based regularization techniques in text generation can refer to methods that aim to control the distribution of word or token frequencies in the generated text. This can be achieved by adding penalties or constraints to the model during training, such as limiting the occurrence of certain words or tokens, promoting the use of less frequent words or tokens, or controlling the balance of word or token frequencies in the generated text.
Frequency-based regularization techniques in text generation can refer to methods that aim to control the distribution of word or token frequencies in the generated text. This can be achieved by adding penalties or constraints to the model during training, such as limiting the occurrence of certain words or tokens, promoting the use of less frequent words or tokens, or controlling the balance of word or token frequencies in the generated text.
===== Remaining Properties =====
===== Remaining Properties =====
The remaining properties are fairly straight forward and require less description than the previous terms.
The remaining properties are fairly straight forward and require less description than the previous terms.
Line 375: Line 325:
* '''''Maximum Content Length''''' - The maximum amount of content from the document to be included, in tokens.
* '''''Maximum Content Length''''' - The maximum amount of content from the document to be included, in tokens.


=== Classification - GPT Embeddings ===
=== GPT Embeddings (Classification Method) ===
'''''GPT Embeddings''''' should be considered a '''BETA''' feature.
'''''GPT Embeddings''''' should be considered a ''BETA'' feature.
* This feature was recently added by the development team without a specific use case in mind.
* This feature was recently added by the development team without a specific use case in mind.
* Rather, it was developed in response to ChatGPT's growing popularity.
* Rather, it was developed in response to ChatGPT's growing popularity.
* While it should work in theory, with no specific use case originating the feature, it has not been extensively tested.
* While it should work in theory, with no specific use case originating the feature, it has not been extensively tested.
* As new use cases emerge that are suited for this feature, this section's documentation will be expanded.
* As new use cases emerge that are suited for this feature, this section's documentation will be expanded.


The '''''GPT Embeddings''''' classification method is a training-based classification approach that uses "embeddings" to tell one document from another.  An embedding is a vector (list) of numbers.  You can determine the difference between embeddings based on the distance between their vectors.  A small distance between embeddings suggests they are highly related.  A low distance between the embeddings suggests they are less related.   
The '''''GPT Embeddings''''' classification method is a training-based classification approach that uses "embeddings" to tell one document from another.  An embedding is a vector (list) of numbers.  You can determine the difference between embeddings based on the distance between their vectors.  A small distance between embeddings suggests they are highly related.  A low distance between the embeddings suggests they are less related.   


When using '''''GPT Embeddings''''' to classify documents, you will train the '''Content Model''' by giving Grooper example documents for each '''Document Type'''.  The GPT model will assign the '''Document Types''' embeddings based on the text content from each trained document.  When documents are classified (using the '''Classify''' activity), embeddings from the unclassified document are compared to the trained embedding values for each '''Document Type'''.  Documents are then assigned the '''Document Type''' with the most similar embeddings.
When using '''''GPT Embeddings''''' to classify documents, you will train the '''Content Model''' by giving Grooper example documents for each '''Document Type'''.  The GPT model will assign the '''Document Types''' embeddings based on the text content from each trained document.  When documents are classified (using the '''''Classify''''' activity), embeddings from the unclassified document are compared to the trained embedding values for each '''Document Type'''.  Documents are then assigned the '''Document Type''' with the most similar embeddings.


For more information on embeddings, visit the following OpenAI documentation:
For more information on embeddings, visit the following OpenAI documentation:
Line 394: Line 343:
&#9888;
&#9888;
|
|
Please be aware embeddings have a maximum number of input tokens per request.  This means there is a cutoff point for longer documents.  How many input tokens are avaiable depends on the GPT model you're using.
Please be aware embeddings have a maximum number of input tokens per request.  This means there is a cutoff point for longer documents.  How many input tokens are available depends on the GPT model you're using.
* OpenAI recommends using the "text-embedding-ada-002" model for embeddings.
* OpenAI recommends using the "text-embedding-ada-002" model for embeddings.
* This model has 8191 maximum input tokens available.
* This model has 8191 maximum input tokens available.
|}
|}


=== Lookup - GPT Lookup ===
=== GPT Lookup (Lookup) ===
Following is a simple of example that will demonstrate how to use the ''GPT Lookup'' functionality. As with everything else regarding GPT Integration in Grooper 2023, this is fairly untested and needs more experimentation to see its full potential. If nothing else, this example is inteded to give you a basic understanding of how to establish the lookup so you can try things out on your own.
Following is a simple of example that will demonstrate how to use the ''GPT Lookup'' functionality. As with everything else regarding GPT Integration in Grooper 2023, this is fairly untested and needs more experimentation to see its full potential. If nothing else, this example is intended to give you a basic understanding of how to establish the lookup so you can try things out on your own.
{|class="how-to-table"
 
|
# Start by deleting all other fields in the example '''Data Model''' other than "Lessor" and "Lessee".
# Start by deleting all other fields in the example '''Data Model''' other than "Lessor" and "Lessee".
#* This is meant to reduce the number of calls you will be making to OpenAI for GPT results as "Lessor" and "Lessee" are the only '''Data Fields''' that will be leveraged in the following lookup example.
#* This is meant to reduce the number of calls you will be making to OpenAI for GPT results as "Lessor" and "Lessee" are the only '''Data Fields''' that will be leveraged in the following lookup example.
|
 
[[Image:GPT Integration 034.png]]
[[Image:GPT Integration 034.png]]
|-
 
|
 
# <li value=2> Right-click the '''Data Model'''.
# <li value=2> Right-click the '''Data Model'''.
# Add a '''Data Field'''.
# Add a '''Data Field'''.
|
 
[[Image:GPT Integration 035.png]]
[[Image:GPT Integration 035.png]]
|-
 
|
 
# <li value=4> Name it "Letter of Thanks".
# <li value=4> Name it "Letter of Thanks".
# Click the "Execute" button.
# Click the "Execute" button.
|
 
[[Image:GPT Integration 036.png]]
[[Image:GPT Integration 036.png]]
|-
 
|
 
# <li value=6> With the newly created '''Data Field''' object selected, set the '''''Display Width''''' property to ''500''.
# <li value=6> With the newly created '''Data Field''' object selected, set the '''''Display Width''''' property to ''500''.
# Set the '''''Multi-line''''' property to ''Enabled''.
# Set the '''''Multi-line''''' property to ''Enabled''.
# Expad the sub-properties of the '''''Multi Line''''' property and set the '''''Display Lines''''' property to ''15''.
# Expand the sub-properties of the '''''Multi Line''''' property and set the '''''Display Lines''''' property to ''15''.
# Set the '''''Word Wrap''''' property to ''True''.
# Set the '''''Word Wrap''''' property to ''True''.
|
 
[[Image:GPT Integration 037.png]]
[[Image:GPT Integration 037.png]]
|-
 
|
 
# <li value=10> Select the '''Data Model'''.
# <li value=10> Select the '''Data Model'''.
# Click the ellipsis button on the '''''Lookups''''' property.
# Click the ellipsis button on the '''''Lookups''''' property.
|
 
[[Image:GPT Integration 038.png]]
[[Image:GPT Integration 038.png]]
|-
 
|
 
# <li value=12> In the "Lookups" window, click the "Add new lookups specification" button.
# <li value=12> In the "Lookups" window, click the "Add new lookups specification" button.
# Select the "GPT Lookup" option.
# Select the "GPT Lookup" option.
|
 
[[Image:GPT Integration 039.png]]
[[Image:GPT Integration 039.png]]
|-
 
|
 
# <li value=14> With the "GPT Lookup" added to the "List of Lookup Specification" and selected, paste in your API key to the '''''API Key''''' property.
# <li value=14> With the "GPT Lookup" added to the "List of Lookup Specification" and selected, paste in your API key to the '''''API Key''''' property.
# Click the ellipsis button for the '''''Prompt''''' property.
# Click the ellipsis button for the '''''Prompt''''' property.
|
 
[[Image:GPT Integration 040.png]]
[[Image:GPT Integration 040.png]]
|-
 
|
 
# <li value=16> In the "Prompt" editor, type the following string:
# <li value=16> In the "Prompt" editor, type the following string:
#* <code>Write a letter of thanks regarding the ease of purchase and clean state of the property from <span style="color:#00b400;">@Lessor</span> to <span style="color:#00b400;">@Lessee</span>.</code>
#* <code>Write a letter of thanks regarding the ease of purchase and clean state of the property from <span style="color:#00b400;">@Lessor</span> to <span style="color:#00b400;">@Lessee</span>.</code>
#* As you type this out (if you do instead of copy pasting) you will notice intellisense pop-up for when you use the <code style="color:#00b400;">@</code> symbol. Using the <code style="color:#00b400;">@</code> symbol allows you to leverage elements from your '''Data Model''' when creating your lookup.
#* As you type this out (if you do instead of copy pasting) you will notice intellisense pop-up for when you use the <code style="color:#00b400;">@</code> symbol. Using the <code style="color:#00b400;">@</code> symbol allows you to leverage elements from your '''Data Model''' when creating your lookup.
# When you have completed writing your prompt, click the "OK" button.
# When you have completed writing your prompt, click the "OK" button.
|
 
[[Image:GPT Integration 041.png]]
[[Image:GPT Integration 041.png]]
|-
 
|
 
# <li value=18> Click the ellipsis button for the '''''Value Selectors''''' property.
# <li value=18> Click the ellipsis button for the '''''Value Selectors''''' property.
|
 
[[Image:GPT Integration 042.png]]
[[Image:GPT Integration 042.png]]
|-
 
|
 
# <li value=19> In the "Value Selectors" window click the "Add new value selector" button.
# <li value=19> In the "Value Selectors" window click the "Add new value selector" button.
|
 
[[Image:GPT Integration 043.png]]
[[Image:GPT Integration 043.png]]
|-
 
|
 
# <li value=20> With "Value Selector" added to the "list of Value Selector" and selected, click the drop-down button for the '''''Target Field''''' property.
# <li value=20> With "Value Selector" added to the "list of Value Selector" and selected, click the drop-down button for the '''''Target Field''''' property.
# Select the "Letter_of_Thanks" field.
# Select the "Letter_of_Thanks" field.
#* Based on this configuration, the value generated by our prompt from our lookup will populate this field with the information generated by GPT.
#* Based on this configuration, the value generated by our prompt from our lookup will populate this field with the information generated by GPT.
|
 
[[Image:GPT Integration 044.png]]
[[Image:GPT Integration 044.png]]
|-
 
|
 
# <li value=22> Back in the "Lookups" menu, scroll down in the property grid, and in the "Lookup Options" area click the drop-down button for the '''''Trigger Mode''''' property.
# <li value=22> Back in the "Lookups" menu, scroll down in the property grid, and in the "Lookup Options" area click the drop-down button for the '''''Trigger Mode''''' property.
# Because <code style="color:#00b400;">@</code> symbols are being used in the prompt to leverage elements from the '''Data Model''' the ''Conditional'' setting should be selected.
# Because <code style="color:#00b400;">@</code> symbols are being used in the prompt to leverage elements from the '''Data Model''' the ''Conditional'' setting should be selected.
|
 
[[Image:GPT Integration 045.png]]
[[Image:GPT Integration 045.png]]
|-
 
|
 
# <li value=24> At the bottom of the property grid notice the '''''Lookup Fields''''' and '''''Target Fields''''' are populated because elements were targetd in the prompt, and a field was targeted with the '''''Value Selectors''''' property.
# <li value=24> At the bottom of the property grid notice the '''''Lookup Fields''''' and '''''Target Fields''''' are populated because elements were targeted in the prompt, and a field was targeted with the '''''Value Selectors''''' property.
# Click "OK" to close this menu.
# Click "OK" to close this menu.
|
 
[[Image:GPT Integration 046.png]]
[[Image:GPT Integration 046.png]]
|-
 
|
 
# <li value=26> With the lookup configured it's time to test. Click the "Tester" tab.
# <li value=26> With the lookup configured it's time to test. Click the "Tester" tab.
# Select "Folder (1)" from the "GPT Complete Examples" batch.
# Select "Folder (1)" from the "GPT Complete Examples" batch.
Line 489: Line 437:
# Notice the "Lessee" value is successfully returned ...
# Notice the "Lessee" value is successfully returned ...
# ... and that it is being leveraged as the salutation in the value created for the "Letter of Thanks" field.
# ... and that it is being leveraged as the salutation in the value created for the "Letter of Thanks" field.
|
 
[[Image:GPT Integration 047.png]]
[[Image:GPT Integration 047.png]]
|-
 
|
 
# <li value=31> Also notice the "Lessor" value being returned ...
# <li value=31> Also notice the "Lessor" value being returned ...
# ... and that it is being leeraged as the complementary close in the value created for the "Letter of Thanks" field.
# ... and that it is being leveraged as the complementary close in the value created for the "Letter of Thanks" field.
# Feel free to take a look at the text created for the letter from the GPT AI.
# Feel free to take a look at the text created for the letter from the GPT AI.
|
 
[[Image:GPT Integration 048.png]]
[[Image:GPT Integration 048.png]]
|}


==== Lookup Properties ====
==== Lookup Properties ====
Following are brief descriptions of properties that are unique to GPT Lookup. Properties that overlap with previously explained properties, or are self explanitory, will be skipped.
Following are brief descriptions of properties that are unique to GPT Lookup. Properties that overlap with previously explained properties, or are self explanatory, will be skipped.


===== Response Format =====
===== Response Format =====
Line 524: Line 471:
: '''XML Notes'''
: '''XML Notes'''
:: In an XML response, the Record Selector may be used as follows:
:: In an XML response, the Record Selector may be used as follows:
::* One record will be geneated for each XML element matched by the selector.
::* One record will be generated for each XML element matched by the selector.
::* Leave the property empty to select a singleton record at the root of the XML.
::* Leave the property empty to select a singleton record at the root of the XML.

Revision as of 12:17, 17 April 2024

This article is about an older version of Grooper.

Information may be out of date and UI elements may have changed.

20252023
Enhancing Grooper by integrating with modern AI technology.

OpenAI GPT integration in Grooper allows users to leverage modern AI technology to enhance their document data integration needs.

OpenAI's GPT model has made waves in the world of computing. Our Grooper developers recognized the potential for this to grow Grooper's capabilities. Adding its functionality will allow for users to explore and find creative solutions for processing their documents using this advanced technology.

You may download the ZIP(s) below and upload it into your own Grooper environment (version 2023). The first contains one or more Batches of sample documents. The second contains one or more Projects with resources used in examples throughout this article.

ABOUT

GPT (Generative Pre-trained Transformer) integration can be used for three things in Grooper:

  • Extraction - Prompt the GPT model to return information it finds in a document.
  • Classification - GPT has been trained against a massive corpus of information, which allows for a lot of potential when it comes to classifying documents. The idea here is that because it's seen so much, the amount of training required in Grooper should be less.
  • Lookup - With a GPT lookup you can provide information collected from a model in Grooper as @ variables in a prompt to have GPT generate data.

In this article you will be shown how Grooper leverages GPT for the aforementioned methods. Some example use cases will be given to demonstrate a basic approach. Given the nature of the way this technology works, it will be up to the user to get creative about how this can be used for their needs.

Things to Consider

Before moving forward it would be prudent to mention a few things about GPT and how to use it.

Prompt Engineering

This first thing to consider is how to structure a good prompt so that you get the results you are expecting. There is a bit of an art to knowing how to do this. GPT can tell bad jokes and write accidentally hilarious poems about your life, but it can also help you do your job better. The catch: you need to help it do its job better, too. At its most basic level, OpenAI's GPT-3 and GPT-4 predict text based on an input called a prompt. But to get the best results, you need to write a clear prompt with ample context. Further on in this article when the GPT Complete Value Extractor is being demonstrated you will see an example of prompt engineering.

Follow this link, or perhaps even this one, for more information on prompt engineering.

Tokens and Pricing

Another consideration is the way GPT pricing works. You are going to be charged for the "tokens" used when interacting with GPT. To that end, the prompt that you write, the text that you leverage to get a result, and the result that is returned to you are all considered part of the token consumption. You will need to be considerate of this as you build and use GPT in your models.

Follow this link for more information on what tokens are.

Follow this link for more information on GPT pricing.

Location Data for Data Extraction

The final thing to consider is in regards to the GPT Complete Value Extractor type (more on this soon.) If you have used Grooper before then you are probably familiar with how a returned value is highlighted with a green box in the document viewer. One of the main strengths of Grooper's text synthesis is that it collects location information for each character which allows this highlighting to occur. The GPT model does not consider location information when generating its results which means there will be no highlighting on the document for values collected with this method. The main impact this will have is on your ability to validate information returned by the GPT model.

How To

With the discussion of concepts out of the way, it is time to get into Grooper and see how and where to use the GPT integration.

Obtain an API Key

Grooper is able to integrate with OpenAI's GPT model because they have provided a web API. All we need in order use the Grooper GPT functionality is an API key. Here you will learn how to obtain an API key for yourself so you can start using GPT with Grooper.

  1. The first thing you should do is visit OpenAI API site and login or create an account.
  2. Once logged in, click the "Personal" menu in the top right.
  3. Within in this menu click the "View API Keys" option, which will take you to the "API keys" page.


  1. On the "API keys" page, click the "+ Create new secret key" button, which will make an "API key generated" pop-up.


  1. Highlight and copy, or click the copy button to copy the key string to your clipboard.
    • A word of warning here. You WILL NOT get another chance to copy this string. You can always create a new one, but once you close this pop-up, you will not have another chance to copy the key string out.

GPT Complete (Extractor Type)

GPT Complete is an Extractor Type that was added to Grooper 2023. It is the setting you choose to leverage GPT integration on an extractor. Below are some examples of configuration and use. You should be able to follow along using the GPT Integration zip files (Batch and Project are included) that are included in this article. Begin by following along with the instructions. The details of the properties will be explained after.

It is also worth noting that the examples given below ARE NOT a comprehensive list. Provided are only a few examples of prompts used in extraction to get you thinking about what can be done. It is highly recommended that you not only reference the materials linked above, but also spend time experimenting and testing. Good luck!

Basic Configuration

  1. After importing the Grooper ZIP files provided with this course, expand the Node Tree out and select the Data Field named "Lessor".
  2. Click the drop-down menu for the Value Extractor property.
  3. Select the GPT Complete option from the menu.


  1. With the Value Extractor property set, click the ellipsis button to open its configuration window (if you prefer, you can instead click the drop-down arrow to the left of the property to edit its properties without a pop-up window).


  1. Start by entering your API key into the API Key property.
  2. Click the "Browse Batches" button.
  3. Select "GPT Complete Examples" Batch in the "GPT Integration - Batches" folder from the menu.


  1. Select "Lease (1)" from the Batch Viewer.
  2. Click the ellipsis button for the Instructions property to open its configuration window (if you prefer, you can instead simply type into the entry field of the property.)


  1. Type the string value Who is the lessor? into the editor.
  2. Click the "OK" button to accept and close this window.


  1. When the previous window closes the extractor will immediately fire (assuming you have automatic testing enabled), and you will see a result returned in the "Results" list view.

From a "prompt engineering" perspective the input we gave it is as basic as you can get. A result is returned, which is great, but it may not be the exact result that is desired. The value supplied is very conversational, which isn't necessarily a bad thing and is typical of an AI that's trained to emulate language, but considering how data is typically constructed in Grooper, it's not quite right. If you break it down, the result given is really four values: the lessor's name, their marital status, their gender, and their location. In this case the name of the lessor only will suffice.

The next thing to tackle will be using some prompt engineering to get a more specific result.

Getting a More Specific Result with Prompt Engineering

  1. Working with the same material as before, select the Data Field named "Lessee".
  2. Click the drop-down menu for the Value Extractor property.
  3. Select GPT Complete from the drop-down menu.


  1. With the Value Extractor set, click the ellipsis button to open its configuration window (if you prefer, you can instead click the drop-down arrow to the left of the property to edit its properties without a pop-up window).


  1. Start by entering your API key into the API Key property.
  2. Make sure "Lease (1)" is still selected in the Batch Viewer.
  3. Click the ellipsis button for the Instructions property to open its configuration window (if you prefer, you can instead simply type into the entry field of the property.)


  1. Type the string value Who is the lessee? into the editor.
  2. Click the "OK" button to accept and close this window.


  1. When the previous window closes the extractor will immediately fire (assuming you have automatic testing enabled), and you will see a result returned in the "Results" list view.
    • This is clearly a different result form the "Lessor", which is good, but let's address the issue mentioned previously. Let's use some simple "prompt engineer" to get the specific result desired.
  2. Click the ellipsis button for the Instructions property to open its configuration window (if you prefer, you can instead simply type into the entry field of the property.)


  1. Add to the string value Respond only with the lessee's name.
  2. Click the "OK" button to accept and close this window.


  1. This is a much better result than before. However, the period at the end is unnecessary and can be removed, again, by prompting the AI appropriately.
  2. Click the ellipsis button for the Instructions property to open its configuration window (if you prefer, you can instead simply type into the entry field of the property.)


  1. Add to the string value Don't include control characters.
  2. Click the "OK" button to accept and close this window.


  1. Perfect! This is the exact value needed.

This is by no means anything but a simple prompt, but notice how giving context and being more specific alters the result. As a user learning this new technology, it's now time to start experimenting with your prompts and getting creative to get the results you're looking for.

Example: Full and Brief Document Summary

  1. Working with the same material as before, select the Data Field named "Full Summary".
  2. Click the drop-down menu for the Value Extractor property.
  3. Select GPT Complete from the drop-down menu.


  1. With the Value Extractor set, click the ellipsis button to open its configuration window (if you prefer, you can instead click the drop-down arrow to the left of the property to edit its properties without a pop-up window).


  1. Start by entering your API key into the API Key property.
  2. Type tldr into the Instructions property.
  3. Assuming you have automatic testing enabled, you will see a result returned in the "Results" list view. Click this result.
  4. Click the "Inspect" button.


  1. In the "Data Inspector" you will see the number of characters in the result.
  2. You will also see the full text of the summary.
  3. Right-click in a blank space to get a list of commands.
  4. Make sure "Text Wrap" is enabled so that the text will wrap like it is in the screenshot.


  1. After confirming the previous settings and closing windows, right-click the "Full Summary" Data Field to get a list of commands.
  2. Select the "Clone..." command.


  1. Name the clone "Brief Summary".
  2. Confirm the clone by clicking the "Execute" button.


  1. With the clone made, click the ellipsis button of the Value Extractor property to open its configuration window (if you prefer, you can instead click the drop-down arrow to the left of the property to edit its properties without a pop-up window).


  1. Add in 100 words or less to the Instructions property.
  2. A result will be returned in the "Results" list view. Select this result.
  3. Click the "Inspect" button.


  1. In the "Data Inspector" you will now notice this result's length is much shorter.
  2. The summary given is much shorter than the previous due to the additional instruction given in the prompt.

Example: Sentiment Analysis

  1. Working with the same material as before, select the Data Field named "Sentiment Analysis".
  2. Click the drop-down menu for the Value Extractor property.
  3. Select GPT Complete from the drop-down menu.


  1. With the Value Extractor set, click the ellipsis button to open its configuration window (if you prefer, you can instead click the drop-down arrow to the left of the property to edit its properties without a pop-up window).


  1. Start by entering your API key into the API Key property.
  2. Click the ellipsis button for the Instructions property to open its configuration window (if you prefer, you can instead simply type into the entry field of the property.)


  1. Type the string Is this document's sentiment positive, negative, or neutral? Respond with only the sentiment and no control characters. into the editor.
  2. Click the "OK" button to accept and close this window.


  1. When the previous window closes, click on "Document (4)" in the Batch Viewer.
  2. Assuming you have automatic testing enabled, you will see a result ("negative") returned in the "Results" list view.


  1. Click on "Document (5)" in the Batch Viewer.
  2. Assuming you have automatic testing enabled, you will see a result ("positive") returned in the "Results" list view.

Extractor Type Properties

Before moving on to seeing how the GPT model is used for classification in Grooper let's take a look at the properties used in the GPT Complete extractor type.

API Key

You must fill this property with a valid API key from OpenAI in order to leverage GPT integration with Grooper. See the Obtain an API Key section above for instruction on how to get a key.

Model

The API Key you use will determine which GPT models are available to you. The different GPT models can affect the text generated based on their size, training data, capabilities, prompt engineering, and fine-tuning potential. GPT-3's larger size and training data, in particular, can potentially result in more sophisticated, diverse, and contextually appropriate text compared to GPT-2. However, the actual performance and quality of the generated text also depend on various other factors, such as prompt engineering, input provided, and specific use case requirements. GPT-4 is the latest version, as of this writing, and takes the GPT model even further.

Temperature

In the context of text generation using language models like ChatGPT, the temperature parameter is a setting that controls the randomness or randomness of the generated text. It is used during the sampling process, where the model selects the next word or token to generate based on its predicted probabilities.

When generating text, the language model assigns probabilities to different words or tokens based on their likelihood of occurring next in the context of the input text. The temperature parameter is used to scale these probabilities before sampling from them. A higher temperature value (e.g., 1.0) makes the probabilities more uniform and increases randomness, resulting in more varied and diverse text. On the other hand, a lower temperature value (e.g., 0.2) makes the probabilities more concentrated and biased towards the most likely word, resulting in more deterministic and focused text.

For example, with a higher temperature setting, the model may generate sentences like:

"The weather is hot and sunny. I love to go swimming or hiking."

With a lower temperature setting, the model may generate sentences like:

"The weather is hot. I love to go swimming."

The choice of temperature parameter depends on the desired output. Higher values are useful when you want more creativity and diversity in the generated text, but it may lead to less coherent or nonsensical sentences. Lower values are useful when you want more deterministic and focused text, but it may result in repetitive or overly conservative output. It's a hyperparameter that can be tuned to achieve the desired balance between randomness and coherence in the generated text.

TopP

TopP, also known as "nucleus sampling" or "stochastic decoding with dynamic vocabulary," is a text generation technique that is used to improve the diversity and randomness of generated text. It is often used as an alternative to traditional approaches like random sampling or greedy decoding in language models, such as GPT-2 and GPT-3.

In TopP sampling, instead of sampling from the entire probability distribution of possible next words or tokens, the model narrows down the choices to a subset of the most likely options. The subset is determined dynamically based on a predefined probability threshold, denoted as "p". The model considers only the words or tokens whose cumulative probability mass (probability of occurrence) falls within the top "p" value. The remaining words or tokens with lower probabilities are pruned from the selection.

Mathematically, given a probability distribution over all possible words or tokens, TopP sampling works as follows:

  1. Compute the cumulative distribution function (CDF) of the probabilities for the given distribution.
  2. Sort the probabilities in descending order and calculate the cumulative sum of probabilities from highest to lowest.
  3. Stop when the cumulative sum exceeds the threshold "p". So 0.1 means only the tokens comprising the top 10% probability mass are considered.
  4. The remaining set of words or tokens whose probabilities fall within the threshold "p" is considered for sampling.

By using TopP sampling, the model can generate text that is more diverse, as it allows for the possibility of selecting less frequent or rarer words or tokens, and it introduces randomness in the selection process. It can prevent the model from becoming overly deterministic or repetitive in its generated output, leading to more creative and varied text generation results.

Presence Penalty

The "presence penalty" is a technique used in text generation to encourage the model to generate more concise and focused outputs by penalizing the repetition of the same words or tokens in the generated text. It is a regularization technique that aims to reduce redundancy and promote diversity in the generated output.

The presence penalty is typically implemented as an additional term in the loss function during the training process of a language model. This term penalizes the model for generating the same words or tokens multiple times within a short span of text. The presence penalty can be formulated in different ways, depending on the specific model architecture and objectives, but the general idea is to assign a higher loss or penalty when the model generates repetitive or redundant text.

The presence penalty encourages the model to generate text that is more concise, avoids repetitive patterns, and promotes the use of a wider vocabulary. It helps prevent the model from generating overly verbose or redundant text, which can be undesirable in certain text generation tasks, such as story generation or summarization.

The magnitude of the presence penalty can be tuned to control the level of repetition allowed in the generated text. A higher penalty value would result in stricter avoidance of repetition, while a lower penalty value would allow for more repetition. The presence penalty is one of the techniques that can be used in combination with other regularization methods, such as temperature scaling, top-k sampling, or fine-tuning, to improve the quality and diversity of generated text.

Frequency Penalty

Frequency-based regularization techniques in text generation can refer to methods that aim to control the distribution of word or token frequencies in the generated text. This can be achieved by adding penalties or constraints to the model during training, such as limiting the occurrence of certain words or tokens, promoting the use of less frequent words or tokens, or controlling the balance of word or token frequencies in the generated text.

Remaining Properties

The remaining properties are fairly straight forward and require less description than the previous terms.

  • Timeout - The amount of time, in seconds, to wait for a response from the web service before raising a timeout error.
  • Instructions - The instructions or question to include in the prompt. The prompt sent to OpenAI consists of text content from the document, which provides context, plus the text entered here. This property should ask a question about the content or provide instructions for generating output. For example, "what is the effective date?", "summarize this document", or "Your task is to generate a comma-separated list of assignors".
  • Preprocessing (Paragraph Marking, Tab Marking, Vertical Tab Marking, Ignore Control Characters) - To put simply, these tools were provided to allow the insertion (or deletion) of control characters to give textual context to information that would otherwise be spatial. GPT does not have an awareness of the location of text you feed it. As a person you can look at a table of information and understand it visually. GPT cannot. However, if you were to have control characters like tabs or paragraph markings, it increases the chance that GPT might understand those things.
  • Overflow Disposition - Specifies the behavior when the document content is longer than the context length of the selected model.
May be one of the following:
  • Truncate - The content will be truncated to fit the model's context length.
  • Split - The content will be split into chunks which fit the model's context length. One result will be returned for each chunk.
  • Context Extractor - An optional extractor which filters the document content included in the prompt. All Value Extractor types are available.
  • Max Response Length - The maximum length of the output, in tokens. 1 token is equivalent to approximately 4 characters for English text. Increasing this value decreases the maximum size of the context.
  • Maximum Content Length - The maximum amount of content from the document to be included, in tokens.

GPT Embeddings (Classification Method)

GPT Embeddings should be considered a BETA feature.

  • This feature was recently added by the development team without a specific use case in mind.
  • Rather, it was developed in response to ChatGPT's growing popularity.
  • While it should work in theory, with no specific use case originating the feature, it has not been extensively tested.
  • As new use cases emerge that are suited for this feature, this section's documentation will be expanded.

The GPT Embeddings classification method is a training-based classification approach that uses "embeddings" to tell one document from another. An embedding is a vector (list) of numbers. You can determine the difference between embeddings based on the distance between their vectors. A small distance between embeddings suggests they are highly related. A low distance between the embeddings suggests they are less related.

When using GPT Embeddings to classify documents, you will train the Content Model by giving Grooper example documents for each Document Type. The GPT model will assign the Document Types embeddings based on the text content from each trained document. When documents are classified (using the Classify activity), embeddings from the unclassified document are compared to the trained embedding values for each Document Type. Documents are then assigned the Document Type with the most similar embeddings.

For more information on embeddings, visit the following OpenAI documentation:

Please be aware embeddings have a maximum number of input tokens per request. This means there is a cutoff point for longer documents. How many input tokens are available depends on the GPT model you're using.

  • OpenAI recommends using the "text-embedding-ada-002" model for embeddings.
  • This model has 8191 maximum input tokens available.

GPT Lookup (Lookup)

Following is a simple of example that will demonstrate how to use the GPT Lookup functionality. As with everything else regarding GPT Integration in Grooper 2023, this is fairly untested and needs more experimentation to see its full potential. If nothing else, this example is intended to give you a basic understanding of how to establish the lookup so you can try things out on your own.

  1. Start by deleting all other fields in the example Data Model other than "Lessor" and "Lessee".
    • This is meant to reduce the number of calls you will be making to OpenAI for GPT results as "Lessor" and "Lessee" are the only Data Fields that will be leveraged in the following lookup example.


  1. Right-click the Data Model.
  2. Add a Data Field.


  1. Name it "Letter of Thanks".
  2. Click the "Execute" button.


  1. With the newly created Data Field object selected, set the Display Width property to 500.
  2. Set the Multi-line property to Enabled.
  3. Expand the sub-properties of the Multi Line property and set the Display Lines property to 15.
  4. Set the Word Wrap property to True.


  1. Select the Data Model.
  2. Click the ellipsis button on the Lookups property.


  1. In the "Lookups" window, click the "Add new lookups specification" button.
  2. Select the "GPT Lookup" option.


  1. With the "GPT Lookup" added to the "List of Lookup Specification" and selected, paste in your API key to the API Key property.
  2. Click the ellipsis button for the Prompt property.


  1. In the "Prompt" editor, type the following string:
    • Write a letter of thanks regarding the ease of purchase and clean state of the property from @Lessor to @Lessee.
    • As you type this out (if you do instead of copy pasting) you will notice intellisense pop-up for when you use the @ symbol. Using the @ symbol allows you to leverage elements from your Data Model when creating your lookup.
  2. When you have completed writing your prompt, click the "OK" button.


  1. Click the ellipsis button for the Value Selectors property.


  1. In the "Value Selectors" window click the "Add new value selector" button.


  1. With "Value Selector" added to the "list of Value Selector" and selected, click the drop-down button for the Target Field property.
  2. Select the "Letter_of_Thanks" field.
    • Based on this configuration, the value generated by our prompt from our lookup will populate this field with the information generated by GPT.


  1. Back in the "Lookups" menu, scroll down in the property grid, and in the "Lookup Options" area click the drop-down button for the Trigger Mode property.
  2. Because @ symbols are being used in the prompt to leverage elements from the Data Model the Conditional setting should be selected.


  1. At the bottom of the property grid notice the Lookup Fields and Target Fields are populated because elements were targeted in the prompt, and a field was targeted with the Value Selectors property.
  2. Click "OK" to close this menu.


  1. With the lookup configured it's time to test. Click the "Tester" tab.
  2. Select "Folder (1)" from the "GPT Complete Examples" batch.
  3. Click the "Test the data element" button.
  4. Notice the "Lessee" value is successfully returned ...
  5. ... and that it is being leveraged as the salutation in the value created for the "Letter of Thanks" field.


  1. Also notice the "Lessor" value being returned ...
  2. ... and that it is being leveraged as the complementary close in the value created for the "Letter of Thanks" field.
  3. Feel free to take a look at the text created for the letter from the GPT AI.

Lookup Properties

Following are brief descriptions of properties that are unique to GPT Lookup. Properties that overlap with previously explained properties, or are self explanatory, will be skipped.

Response Format

This specifies the format in which data will be exchanged with the web service. Can be one of the following values:

  • Text - The response will be plain text. Record and value selectors should be specified using regular expressions.
  • JSON - The response will be in JSON format. Record and value selectors should be specified using JSONPath syntax.
  • XML - The request and response body will be in XML format. Record and value selectors should be specified using XPath syntax.

The format selected here will be used both for sending POST data and interpreting responses. It is currently not possible to send an XML request then interpret the response as JSON, or vice-versa.

Record Selector

This is a JSONPath or XPath expression which selects records in the response.

The record selector is used to specify which JSON or XML entities represent records in the result set.

JSON Notes
In a JSON response, the Record Selector may be used as follows:
  • If the selector matches an array, one record will be generated for each element of the array.
  • If the selector matches one or more objects, one record will be generated for each object.
  • Leave the property empty to select an array or object at the root of the JSON document.
XML Notes
In an XML response, the Record Selector may be used as follows:
  • One record will be generated for each XML element matched by the selector.
  • Leave the property empty to select a singleton record at the root of the XML.