2023.1:Correct (Activity): Difference between revisions

From Grooper Wiki
// via Wikitext Extension for VSCode
No edit summary
 
(9 intermediate revisions by 3 users not shown)
Line 1: Line 1:
{{AutoVersion}}
{{AutoVersion}}


{|class="wip-box"
 
<blockquote>{{#lst:Glossary|Correct}}</blockquote>
 
{|class="download-box"
|
|
'''WIP'''
[[File:Asset 22@4x.png]]
|
|
This article is a work-in-progress or created as a placeholder for testing purposes. This article is subject to change and/or expansion. It may be incomplete, inaccurate, or stop abruptly.
You may download the ZIP(s) below and upload it into your own Grooper environment (version 2023.1). The first contains one or more '''Batches''' of sample documents. The second contains one or more '''Projects''' with resources used in examples throughout this article.
 
* [[Media:2023.1 Wiki Correct-(Activity) Batch.zip]]
This tag will be removed upon draft completion.
* [[Media:2023.1 Wiki Correct-(Activity) Project.zip]]
|}
|}
<blockquote>{{#lst:Glossary|Correct}}</blockquote>
== Glossary ==


== About ==
== About ==


When running OCR on a document, you don't always get perfect results. The OCR Engine makes mistakes. You can use [[Fuzzy RegEx (Concept)|Fuzzy Matching]] to still extract the information that you want, but the original OCRed text data attached to the document file will still reflect that bad OCR. The ''Correct'' '''Batch Process Step''' will change the text data attached to the document so when you export your files, the digital text will be more accurate.  
When running OCR on a document, you don't always get perfect results. The OCR Engine makes mistakes. You can use [[Fuzzy RegEx|fuzzy matching]] to still extract the information that you want, but the original OCRed text data attached to the document file will still reflect that bad OCR. The ''Correct'' '''Batch Process Step''' will change the text data attached to the document so when you export your files, the digital text will be more accurate.  


You can also remove sections of text from the digital text of the document using the ''Correct'' '''Batch Process Step'''. However, the ''Correct'' '''Batch Process Step''' will not remove the information from the document image or PDF. To do that, you would need to add a [[Redact (Activity)|Redact]] '''Batch Process''' step.  
You can also remove sections of text from the digital text of the document using the ''Correct'' '''Batch Process Step'''. However, the ''Correct'' '''Batch Process Step''' will not remove the information from the document image or PDF. To do that, you would need to add a [[Redact (Activity)|Redact]] '''Batch Process''' step.  


== How To ==
== How To ==
In this tutorial we are going to show how to set up the ''Correct'' '''Batch Process Step''' to do two things: correct bad OCR in text data and completely remove text from the text data of a document. We will start with correcting OCR.


<div style="padding-left: 1.5em";>
=== Correcting OCR ===
Using the ''Correct'' '''Batch Process Step''' changes the text data attached to the document. When you have an extractor with '''''Fuzzy Matching''''' enabled, it is capable of extracting the correct information even with bad OCR. The ''Correct'' step will take the extracted text and replace the original OCR. The first thing we need to do is have an extractor using Fuzzy Logic collecting what we want to correct.
<div style="padding-left: 1.5em";>
<div style="padding-left: 1.5em";>


Line 46: Line 51:
<b><big>Adding the Correct Batch Process Step</big></b>
<b><big>Adding the Correct Batch Process Step</big></b>


# Right-click on the '''Batch Process''' in the node tree.
# Hover over "Add Activity", then hover over "Cleanup & Recognition". Finally, click on "Correct...".
# When the window pops up, change the '''''Step Name''''' if you would like. We are going to keep the default name of "Correct".
# Click "EXECUTE" in the top right hand corner of the pop up window.
[[File:2023.1 Correct-(Activity) 02 02 BP Step 01.png]]
<b><big>Configuring for Correcting OCR</big></b>
# Now you should have a ''Correct'' '''Batch Process Step''' in your node tree.
# Click the hamburger icon tot he right of the '''''Scope''''' property in the "Step Properties" grid.
# Generally you will want to set the '''''Scope''''' property to ''Page'' because ''Recognize'' is usually run at a ''page'' level and ''Correct'' needs text data to run. So, we are going to set the '''''Scope''''' to a ''page'' level.
[[File:2023.1 Correct-(Activity) 02 03 Correction-Extractor 01.png]]
# <li value=4> Set the '''''Scope''''' in the "Activities Properties" grid by clicking the hamburger icon to the right of the property.
# You can set the scope to either correct the whole document or only specific fields. It is recommended most of the time to set to correct the whole document. So, we are going to set the '''''Scope''''' to ''Document''.
[[File:2023.1 Correct-(Activity) 02 03 Correction-Extractor 02.png]]
#<li value=6> Set the '''''Enable Spell Correction''''' to ''True''.
[[File:2023.1 Correct-(Activity) 02 03 Correction-Extractor 03.png]]
# <li value=7> Set your '''''Correction Extractor'''''. We are going to set it to a ''Reference'' in this tutorial.
# We are going to set the reference to the "Field Labels" '''Value Reader''' that has '''''Fuzzy Matching''''' enabled.
[[File:2023.1 Correct-(Activity) 02 03 Correction-Extractor 04.png]]
#<li value=9> Once you have finished configuring your step, click the save icon at the top of the property grid to save your changes.
[[File:2023.1 Correct-(Activity) 02 03 Correction-Extractor 05.png]]
</div>
=== Using A Correct Step for Removal ===
Often you may have sensitive information on a document such as personal contact information, social security numbers (SSNs), account information, etc. The [[Redact (Activity)|Redact Activity]] can black out this extracted information on the document itself so it cannot be read. However, the ''Redact Activity'' does not change the text data of the document. If someone were to look at the text data attached to a redacted PDF, they would see the original information intact. To fully remove the information from the text data, we need to use the '''''Removal Extractor''''' property in the ''Correct'' '''Batch Process Step'''.
In this example we are going to configure the '''''Removal Extractor''''' on the same ''Correct'' '''Step''' that is already correcting bad OCR. You don't necessarily need to configure both for the step to function. If you only need the text data deleted and no corrections made, you can just configure the '''''Removal Extractor''''' by itself (or vice versa).
# For this tutorial we have already put together a '''Data Type''' that is extracting the sensitive information on our document that we want to remove from the text data.
# Two SSNs or account numbers are being returned by the extractor on this document.
[[File:2023.1 Correct-(Activity) 02 04 Removal 01.png]]
#<li value=3> Back in the ''Correct'' '''Batch Process Step''', set the '''''Removal Extractor''''' property by clicking the hamburger icon to the right of the property.
# In our example, we set the '''''Removal Extractor''''' to a ''Reference'' extractor. You can set it to any Value Extractor you wish to collect the information you want removed.
[[File:2023.1 Correct-(Activity) 02 04 Removal 02.png]]
#<li value=5> Open the '''''Removal Extractor''''' property and click the hamburger icon to the right of the '''''Extractor''''' property.
# Select the Value Extractor that is returning the information you want removed from the text data. We are going to select the '''Data Type''' that is extracting the SSNs and account numbers.
[[File:2023.1 Correct-(Activity) 02 04 Removal 03.png]]




#<li value=7> Click on the save icon at the top of the property grid to save your changes.


[[File:2023.1 Correct-(Activity) 02 04 Removal 04.png]]
</div>
</div>

Latest revision as of 16:28, 27 August 2025

This article is about an older version of Grooper.

Information may be out of date and UI elements may have changed.

20252023.1


abc Correct is an Activity that performs spell correction. It can correct a folder Batch Folder's text content or specific Data Element values to resolve OCR errors, deidentify data or otherwise enhance text data.

You may download the ZIP(s) below and upload it into your own Grooper environment (version 2023.1). The first contains one or more Batches of sample documents. The second contains one or more Projects with resources used in examples throughout this article.

About

When running OCR on a document, you don't always get perfect results. The OCR Engine makes mistakes. You can use fuzzy matching to still extract the information that you want, but the original OCRed text data attached to the document file will still reflect that bad OCR. The Correct Batch Process Step will change the text data attached to the document so when you export your files, the digital text will be more accurate.

You can also remove sections of text from the digital text of the document using the Correct Batch Process Step. However, the Correct Batch Process Step will not remove the information from the document image or PDF. To do that, you would need to add a Redact Batch Process step.

How To

In this tutorial we are going to show how to set up the Correct Batch Process Step to do two things: correct bad OCR in text data and completely remove text from the text data of a document. We will start with correcting OCR.

Correcting OCR

Using the Correct Batch Process Step changes the text data attached to the document. When you have an extractor with Fuzzy Matching enabled, it is capable of extracting the correct information even with bad OCR. The Correct step will take the extracted text and replace the original OCR. The first thing we need to do is have an extractor using Fuzzy Logic collecting what we want to correct.

Fuzzy Matching

  1. "Employer Signature" is an entry in our List Match as pictured below.
  2. However, we are not collecting the "Employer Signature" label from the document.
  3. To find out why, click on the Renditions icon located in the top right corner of the Document Viewer.
  4. Click on Text from the drop down.


  1. We are not getting the result because of bad OCR. We can see in the text that "Employer Signature" was recognized as "Employer Signat re".


  1. If we turn on Fuzzy Matching, we can then get the result that we want. However, the text data remains the same. Only the extraction is corrected.


Adding the Correct Batch Process Step

  1. Right-click on the Batch Process in the node tree.
  2. Hover over "Add Activity", then hover over "Cleanup & Recognition". Finally, click on "Correct...".
  3. When the window pops up, change the Step Name if you would like. We are going to keep the default name of "Correct".
  4. Click "EXECUTE" in the top right hand corner of the pop up window.


Configuring for Correcting OCR

  1. Now you should have a Correct Batch Process Step in your node tree.
  2. Click the hamburger icon tot he right of the Scope property in the "Step Properties" grid.
  3. Generally you will want to set the Scope property to Page because Recognize is usually run at a page level and Correct needs text data to run. So, we are going to set the Scope to a page level.


  1. Set the Scope in the "Activities Properties" grid by clicking the hamburger icon to the right of the property.
  2. You can set the scope to either correct the whole document or only specific fields. It is recommended most of the time to set to correct the whole document. So, we are going to set the Scope to Document.


  1. Set the Enable Spell Correction to True.


  1. Set your Correction Extractor. We are going to set it to a Reference in this tutorial.
  2. We are going to set the reference to the "Field Labels" Value Reader that has Fuzzy Matching enabled.


  1. Once you have finished configuring your step, click the save icon at the top of the property grid to save your changes.

Using A Correct Step for Removal

Often you may have sensitive information on a document such as personal contact information, social security numbers (SSNs), account information, etc. The Redact Activity can black out this extracted information on the document itself so it cannot be read. However, the Redact Activity does not change the text data of the document. If someone were to look at the text data attached to a redacted PDF, they would see the original information intact. To fully remove the information from the text data, we need to use the Removal Extractor property in the Correct Batch Process Step.

In this example we are going to configure the Removal Extractor on the same Correct Step that is already correcting bad OCR. You don't necessarily need to configure both for the step to function. If you only need the text data deleted and no corrections made, you can just configure the Removal Extractor by itself (or vice versa).

  1. For this tutorial we have already put together a Data Type that is extracting the sensitive information on our document that we want to remove from the text data.
  2. Two SSNs or account numbers are being returned by the extractor on this document.


  1. Back in the Correct Batch Process Step, set the Removal Extractor property by clicking the hamburger icon to the right of the property.
  2. In our example, we set the Removal Extractor to a Reference extractor. You can set it to any Value Extractor you wish to collect the information you want removed.


  1. Open the Removal Extractor property and click the hamburger icon to the right of the Extractor property.
  2. Select the Value Extractor that is returning the information you want removed from the text data. We are going to select the Data Type that is extracting the SSNs and account numbers.


  1. Click on the save icon at the top of the property grid to save your changes.