2023.1:Image Processing (Activity): Difference between revisions

From Grooper Wiki
// via Wikitext Extension for VSCode
// via Wikitext Extension for VSCode
Line 93: Line 93:




=== Image Processing Considerations ===
=== PDF Options: Bursting vs. Rendering ===


Imagine you have a mix of image-based and text-based PDFs in a '''Batch'''. You could even have PDF files that have a mix of image-based and text-based pages within a single file. Some of these image-based pages may need permanent image cleanup, using an '''IP Profile''' and the '''Image Processing''' activity.  However, there's generally no reason to apply an '''IP Profile''' to a text-based PDF
# Finally, we have the PDF Options properties. By default they are both enabled. How you configure these properties largely depends on the type of documents you are working with.
* The point of '''Image Processing''' is to clean up an image before handing that image to an OCR engine.  You're not going to OCR text-based pages, you're just going to extract their native text data.
 
[[File:2023.1 Image-Processing-(Activity) 02 How-To 02 PDF-Options 01.png]]




So what happens if you feed split pages to the '''Image Processing''' activity, some of whom are PDF page objects, others which are JPEG page objects? 
* For JPEG page objects, the '''Image Processing''' activity will apply the '''IP Profile''' no matter what.
* For PDF page objects, it depends.  '''Image Processing''' will ignore PDF pages depending on two things:
*# The PDF page type (image-based or text-based)
*# How two '''Image Processing''' properties are configured: '''''Bursting''''' and '''''Rendering'''''.


{|class="attn-box"
{|class="attn-box"
|
|
⚠
|
{|class="inner-box"
|
|
'''BEWARE OF COMMONLY USED TERMS ACCROSS MULTIPLE ACTIVITES'''
'''BEWARE OF COMMONLY USED TERMS ACROSS MULTIPLE ACTIVITES'''


The ''Image Processing'' activity's '''''Bursting''''' and '''''Rendering''''' properties are related to, but distinct from the '''Split Pages''' activity's and '''''Rasterize''''' command's '''''Bursting''''' and '''''Rendering''''' properties.


The '''Image Processing''' activity's '''''Bursting''''' and '''''Rendering''''' properties are related to, but distinct from the '''Split Pages''' activity's and '''''Rasterize''''' command's '''''Bursting''''' and '''''Rendering''''' properties.
For ''Image Processing'', the '''''Bursting''''' and '''''Rendering''''' properties only pertain to how different types of PDF file types are processed by an '''IP Profile'''.
|}




For '''Image Processing''', the '''''Bursting''''' and '''''Rendering''''' properties only pertain to how different types of PDF file types are processed by an '''IP Profile'''.
The point of ''Image Processing'' is to clean up an image before handing that image to an OCR engine. With this in mind, there are three main types of documents you will be processing in Grooper:
|
[[File:2023_Split-Pages_03_Bursting-and-Rendering-PDFs_09.png]]


''The Bursting and Rendering properties in Image Processing's property grid.''
# JPEG page objects, which always need OCR.
|}
# Image-based PDF page objects, which always need OCR.
|}
# Text-based PDF page objects, which NEVER need OCR since they already have native, readable text.


So, what happens in ''Image Processing'' for each of these types of documents? The ''Image Processing'' activity will conditionally apply the '''IP Profile''', given the following:


The '''Image Processing''' activity will conditionally apply the '''IP Profile''', given the following:
{|class=wikitable
{|class=wikitable
|'''Page Type'''||'''Result'''||'''Notes'''
|'''Page Type'''||'''Result'''||'''Notes'''
Line 132: Line 126:
|JPEG pages||The '''IP Profile''' will be applied in all cases, no matter what.
|JPEG pages||The '''IP Profile''' will be applied in all cases, no matter what.
|
|
* This is normal behavior for the '''Image Processing''' activity.  It generally expects to process images.
* This is normal behavior for the ''Image Processing'' activity.  It generally expects to process images.
|-
|-
|Image-based PDF pages||The '''IP Profile''' will only be applied if '''''Bursting''''' is enabled.
|Image-based PDF pages||The '''IP Profile''' will only be applied if '''''Bursting''''' and/or '''''Rendering''''' are enabled.
|
|
* This will overwrite the PDF with a JPEG image with the applied changes.
* The PDF will be copied over or overwritten as an image with the applied changes.
|-
|-
|Text-based PDF pages||ONLY an '''Orient''' or '''Auto-Orient''' step will be applied if present in the '''IP Profile''' ONLY IF '''''Rendering''''' is enabled..
|Text-based PDF pages||ONLY '''Orient''' or '''Auto-Orient''' steps will be applied if '''''Rendering''''' is enabled.
|
|
* This will ONLY rotate the PDF's orientation. The page will still remain a PDF after the orientation change is applied.
* This will ONLY rotate the PDF's orientation. Since it does not need OCR, there are no other changes that would need to be made.
|}
|}
{|cellpadding=10 cellspacing=5
|valign=top|
For example, imagine you have a three page PDF file.
* The first page is a text-based page.
* The second page is an image-based page requiring some image cleanup.
** It needs to be de-skewed and have its border cropped.
* The third page is a text-based page, but needs to be re-oriented.
The '''Image Processing''' activity's '''''Bursting''''' and '''''Rendering''''' properties will allow you to do this.
In this scenario, an '''IP Profile''' with the following steps would appropriately clean up the pages' problems:
* '''Auto Orient'''
* '''Auto Deskew'''
* '''Auto Border Cleanup'''
With '''''Bursting''''' and '''''Rendering''''' enabled, the '''Image Processing''' activity would affect the pages in the following ways:
* As a text-based PDF page with no orientation issues, the first page would not be processed at all.
* As an image-based PDF page, the second page would be processed by the '''IP Profile'''.  Since the '''IP Profile''' made changes, the PDF page is overwritten with the updated image.
* As a text-based PDF page with orientation issues, ONLY the '''Auto Orient''' step would be applied to the page.  The remaining steps would be ignored.
|valign=top|
[[File:Split-pages-ip-graphic.png]]
|}
</div>

Revision as of 09:29, 16 July 2024

This article is about an older version of Grooper.

Information may be out of date and UI elements may have changed.

20252023.1

WIP

This article is a work-in-progress or created as a placeholder for testing purposes. This article is subject to change and/or expansion. It may be incomplete, inaccurate, or stop abruptly.

This tag will be removed upon draft completion.

wallpaper Image Processing is an Activity that enhances contract Batch Page images and optimizes them for better OCR text recognition and data extraction results.

About

The Image Processing Activity (generally applied via a Batch Process Step) applies a preconfigured IP Profile to a document.

An IP Profile lists a series of steps to performing image processing functions called "IP Commands". There are several IP Commands in Grooper, including ones that remove borders from an image, adjust the skew angle of an image, change the color format of an image, and more. For more information on configuring an IP Profile, visit the IP Profile wiki page.

Permanent vs. Temporary Image Processing

The Image Processing Activity permanently alters a document's image by applying an IP Profile. However, it is possible to temporarily clean up document images to benefit OCR results and revert back to the original document image. This needs to be done during the Recognize Activity rather than the Image Processing Activity.

For example, you may have a document where table lines are getting in the way of accurate OCR. However, if you remove these lines during the Image Processing activity, they will be permanently removed, making it difficult to review the documents in Review and changing the archival image stored later to something that no longer looks like the original document.

Instead, you can use an OCR Profile referencing an IP Profile containing a Line Removal command during Recognize. The image will be temporarily changed according to the IP Profile. Then, OCR will run on the altered image. Last, the image will revert back to its original form.

For more information on Temporary Image Processing, please see the OCR Profile, IP Profile, and Recognize wiki pages.

The Image Processing Activity

If you are more interested in making permanent changes to documents to clean up the pages and improve OCR results, then you might consider adding an Image Processing Batch Process Step to your Batch Process. The following are just a few things you can do by adding an appropriately configured IP Profile to your Image Processing Batch Process Step:

This page is slightly askew. A Deskew IP Step can correct this.
The off-white background of this page can make certain things hard to read. A Binarize IP Step can change the image to black and white.
This page has a dark border around it that can make OCR more difficult. The Auto Border Crop and Border Fill IP Steps can remove the border.

For more examples, instructions, and tips on setting up an IP Profile, take a look at the IP Profile wiki article.

How To

Adding the Image Processing Step

  1. Right click on the Batch Process.
  2. Hover over "Add Activity", then hover over "Cleanup & Recognition". Then click on "Image Processing..."
  3. When the "Add Activity" window pops up, you can change the Step Name if you like. In this tutorial we are going to keep it as the default of "Image Processing".
  4. Click "EXECUTE" at the top right corner of the "Add Activity" window.


  1. Now you should have an Image Processing Batch Process Step in your Batch Process.
  2. By default, the Scope property is set to Page. Generally, you want to keep your Scope to a Page level for Image Processing because the step permanently edits an image and must do so at the Page level.



Configuring the Batch Process Step

  1. We are using an IP Profile that we have copied and pasted from the "Essentials" Project. The "Essentials" Project comes pre-installed with every Grooper Repository.
  2. Click the hamburger icon to the right of the IP Profile property.
  3. Navigate to and select the IP Profile.


  1. If you want Grooper to save an unedited copy of the file attached to the document, click the check box next to Enable Undo to set the property to True.


  1. You can click the hamburger icon to the right of the Compression property to set a custom format for this Batch Process Step. If left as the default (none), then it will use the compression settings specified on the root node of the repository.


PDF Options: Bursting vs. Rendering

  1. Finally, we have the PDF Options properties. By default they are both enabled. How you configure these properties largely depends on the type of documents you are working with.


BEWARE OF COMMONLY USED TERMS ACROSS MULTIPLE ACTIVITES

The Image Processing activity's Bursting and Rendering properties are related to, but distinct from the Split Pages activity's and Rasterize command's Bursting and Rendering properties.

For Image Processing, the Bursting and Rendering properties only pertain to how different types of PDF file types are processed by an IP Profile.


The point of Image Processing is to clean up an image before handing that image to an OCR engine. With this in mind, there are three main types of documents you will be processing in Grooper:

  1. JPEG page objects, which always need OCR.
  2. Image-based PDF page objects, which always need OCR.
  3. Text-based PDF page objects, which NEVER need OCR since they already have native, readable text.

So, what happens in Image Processing for each of these types of documents? The Image Processing activity will conditionally apply the IP Profile, given the following:

Page Type Result Notes
JPEG pages The IP Profile will be applied in all cases, no matter what.
  • This is normal behavior for the Image Processing activity. It generally expects to process images.
Image-based PDF pages The IP Profile will only be applied if Bursting and/or Rendering are enabled.
  • The PDF will be copied over or overwritten as an image with the applied changes.
Text-based PDF pages ONLY Orient or Auto-Orient steps will be applied if Rendering is enabled.
  • This will ONLY rotate the PDF's orientation. Since it does not need OCR, there are no other changes that would need to be made.