2023.1:Correct (Activity)

From Grooper Wiki

This article is about an older version of Grooper.

Information may be out of date and UI elements may have changed.

20252023.1


abc Correct is an Activity that performs spell correction. It can correct a folder Batch Folder's text content or specific Data Element values to resolve OCR errors, deidentify data or otherwise enhance text data.

You may download the ZIP(s) below and upload it into your own Grooper environment (version 2023.1). The first contains one or more Batches of sample documents. The second contains one or more Projects with resources used in examples throughout this article.

About

When running OCR on a document, you don't always get perfect results. The OCR Engine makes mistakes. You can use fuzzy matching to still extract the information that you want, but the original OCRed text data attached to the document file will still reflect that bad OCR. The Correct Batch Process Step will change the text data attached to the document so when you export your files, the digital text will be more accurate.

You can also remove sections of text from the digital text of the document using the Correct Batch Process Step. However, the Correct Batch Process Step will not remove the information from the document image or PDF. To do that, you would need to add a Redact Batch Process step.

How To

In this tutorial we are going to show how to set up the Correct Batch Process Step to do two things: correct bad OCR in text data and completely remove text from the text data of a document. We will start with correcting OCR.

Correcting OCR

Using the Correct Batch Process Step changes the text data attached to the document. When you have an extractor with Fuzzy Matching enabled, it is capable of extracting the correct information even with bad OCR. The Correct step will take the extracted text and replace the original OCR. The first thing we need to do is have an extractor using Fuzzy Logic collecting what we want to correct.

Fuzzy Matching

  1. "Employer Signature" is an entry in our List Match as pictured below.
  2. However, we are not collecting the "Employer Signature" label from the document.
  3. To find out why, click on the Renditions icon located in the top right corner of the Document Viewer.
  4. Click on Text from the drop down.


  1. We are not getting the result because of bad OCR. We can see in the text that "Employer Signature" was recognized as "Employer Signat re".


  1. If we turn on Fuzzy Matching, we can then get the result that we want. However, the text data remains the same. Only the extraction is corrected.


Adding the Correct Batch Process Step

  1. Right-click on the Batch Process in the node tree.
  2. Hover over "Add Activity", then hover over "Cleanup & Recognition". Finally, click on "Correct...".
  3. When the window pops up, change the Step Name if you would like. We are going to keep the default name of "Correct".
  4. Click "EXECUTE" in the top right hand corner of the pop up window.


Configuring for Correcting OCR

  1. Now you should have a Correct Batch Process Step in your node tree.
  2. Click the hamburger icon tot he right of the Scope property in the "Step Properties" grid.
  3. Generally you will want to set the Scope property to Page because Recognize is usually run at a page level and Correct needs text data to run. So, we are going to set the Scope to a page level.


  1. Set the Scope in the "Activities Properties" grid by clicking the hamburger icon to the right of the property.
  2. You can set the scope to either correct the whole document or only specific fields. It is recommended most of the time to set to correct the whole document. So, we are going to set the Scope to Document.


  1. Set the Enable Spell Correction to True.


  1. Set your Correction Extractor. We are going to set it to a Reference in this tutorial.
  2. We are going to set the reference to the "Field Labels" Value Reader that has Fuzzy Matching enabled.


  1. Once you have finished configuring your step, click the save icon at the top of the property grid to save your changes.

Using A Correct Step for Removal

Often you may have sensitive information on a document such as personal contact information, social security numbers (SSNs), account information, etc. The Redact Activity can black out this extracted information on the document itself so it cannot be read. However, the Redact Activity does not change the text data of the document. If someone were to look at the text data attached to a redacted PDF, they would see the original information intact. To fully remove the information from the text data, we need to use the Removal Extractor property in the Correct Batch Process Step.

In this example we are going to configure the Removal Extractor on the same Correct Step that is already correcting bad OCR. You don't necessarily need to configure both for the step to function. If you only need the text data deleted and no corrections made, you can just configure the Removal Extractor by itself (or vice versa).

  1. For this tutorial we have already put together a Data Type that is extracting the sensitive information on our document that we want to remove from the text data.
  2. Two SSNs or account numbers are being returned by the extractor on this document.


  1. Back in the Correct Batch Process Step, set the Removal Extractor property by clicking the hamburger icon to the right of the property.
  2. In our example, we set the Removal Extractor to a Reference extractor. You can set it to any Value Extractor you wish to collect the information you want removed.


  1. Open the Removal Extractor property and click the hamburger icon to the right of the Extractor property.
  2. Select the Value Extractor that is returning the information you want removed from the text data. We are going to select the Data Type that is extracting the SSNs and account numbers.


  1. Click on the save icon at the top of the property grid to save your changes.