Image Processing (Concept)

From Grooper Wiki
Jump to navigation Jump to search

Grooper's robust suite of image processing operations gives you highly configurable control of how your documents are cleaned up before OCR

These operations generally fall into three categories:

  1. Archival Adjustments - These are permanent adjustments to the exported document's image.
  2. OCR Cleanup - Image cleanup can dramatically improve OCR results.
    • However, they can also drastically alter the document's image. Image adjustments are temporarily applied to a document prior to OCR when an IP Profile is executed during the Recognize activity. This is useful for non-destructive image clean up to improve OCR results, keeping the document's pages as their original image to preserve their archival images upon export.
  3. Layout Data Collection - This includes visual information used for data extraction purposes (such as table line locations, barcode information, OMR checkbox states) as well as image features used for Visual classification.
    • Layout Data can be collected either during the Image Processing or the Recognize activities.

About Grooper's Image Processing Software

Regardless of how good an OCR Engine is, OCR is very rarely perfect. Characters can be segmented out from words wrong. Artifacts such as table lines, check boxes or even just specks from image noise can interfere with character segmenting and character recognition. Even when they are segmented out correctly, the OCR engine's character recognition can make the wrong decision about what the character is.

Image Processing (often abbreviated as "IP") can assist the OCR operation by providing a "cleaner" image to the OCR Engine. The general idea is to give the OCR engine just the text pixels, so that is all the engine needs to process.

This image is much easier for OCR to process... ...than this image.
Image-processing-digital-image-processing-after.png Image-processing-digital-image-processing-before.png

IP Profiles

Images are altered using an IP Profile, which contains a step by step list of IP Commands, each of which performs a specific alteration to the image. IP Profiles are highly configurable. There are multiple different IP Commands, each of which has its own configurable properties as well.

In the example above, the image was altered using an IP Profile with six steps, each step containing a different IP Command.

This is the list of steps in this IP Profile, each one named for the IP Command used: Auto Border Crop, Binarize, Shape Removal, Line Removal, Speck Removal, and Blob Removal

Ip profile 1.png
! Order of operation matters! The image is altered step by step, from the first to the last. The first step hands the second step the results of its IP Command. The second step runs using the mutated image not the original image. The second step then hands its result to the third step and so on and so on.

For this IP Profile, the first step runs the Auto Border Crop command. It crops the image, removing its border.

In this case, the border was actually part of the document. Usually, borders appear around documents because of how they were scanned. However, the goal is still the same. Remove superfluous, non-text pixels interfering with the OCR operation.

Before After
Ip profile 3.png Ip profile 2.png

The second step runs the Binarize command. The Binarize command turns the image black and white.

OCR requires a black and white image to analyze pixels and segment them out into characters. It needs a binary representation of the image: "Is text" or "is not text" While OCR engines will binarize an image as part of their pre-processing phase, doing so in an IP Profile gives you control over how that operation is done. There is more than one way to binarize an image, and you have no control over it if you let the OCR engine do it for you.

Before After
Ip profile 2.png Ip profile 4.png

The next step runs the Shape Removal command. Shape Removal takes trained examples of shapes, in this case the company logo, locates them on a document and removes them.

While probably not strictly necessary for this example, removing the shape does give OCR one less thing the OCR engine has to look at when segmenting out characters and figuring out what characters these pixels should be.

Before A dropout mask is created for
the detected sample shape.
Ip profile 4.png Ip profile 5.png Ip profile 6.png

The last three commands are Line Removal, Speck Removal, and Blob Removal. These three commands find three different types of pixel artifacts (lines, specks and blobs) and remove them, further isolating pixels that are only text characters.

Line Removal
Before A dropout mask is created
for detected lines.
Ip profile 6.png Ip profile 7.png Ip profile 8.png
Speck Removal
Before A dropout mask is created for small specks.
Here, mostly getting rid of dotted lines.
Ip profile 8.png Ip profile 9.png Ip profile 10.png

Blob Removal
Before A dropout mask is created for
detected blobs of a defined size.
Ip profile 10.png Ip profile 11.png Ip profile 12.png

You are left with an image that the OCR engine can much easier break up into line, word and finally character segments, vastly improving the accuracy of character recognition.

A portion of OCR results without applying the IP Profile The same results with the IP Profile applied.
Ip profile 13.png Ip profile 14.png

However, for the example above, the IP Profile's result is drastically different from the original image. While it certainly helps the OCR result, it's likely, at the end of the process, you want to export a document that looks more like the "before" picture than the "after". Luckily, Image Processing can be performed in two ways:

  1. Permanent for archival purposes.
  2. Temporary for non-destructive OCR cleanup.

Permanent IP

The Image Processing activity's property panel. Permanent IP is done via the Image Processing activity, using an IP Profile set on the IP Profile property.

Permanent Image Processing is done via the Image Processing activity. It is, as the name implies, a permanent alteration of the document's image. The Image Processing activity will reference an IP Profile and permanently apply its IP Commands to the document images. Once that image is changed, it is changed for the remainder of the Batch Process. IP Profiles used by the Image Processing activity should only use commands acceptable for final export.

FYI The Image Processing activity permanently alters the image. There is no going back! Except when there is. By default there is no "Undo" option to revert to the original, pre-processed image. However, if you turn the Enable Undo property to True, Grooper will save the original image to the filestore. Note this makes a second copy of the image, which can significantly increase the memory requirements of your filestore.

The three categories of most commonly used IP Commands for permanent cleanup are

  • Border Cleanup - These commands clean up border artifacts around an image by cropping the image or filling in the border with a given color.
  • Color Adjustment - These commands adjust the color values of the image, including brightness, color saturation, and contrast.
  • Image Transforms - These commands change the image's size and orientation.

Of the commands in those categories, there are one or two that are particularly common.

Border Cleanup

Auto Border Crop, Border Fill

Color Adjustment

Brightness Contrast, Contrast Stretch

Image Transforms

Auto Deskew, Auto Orient

Temporary IP

There's a great deal of image processing that is extremely helpful for OCR, but ultimately renders the document almost indistinguishable from the original. Grooper allows you to temporarily alter the image before OCR is performed during the Recognize activity. In this case, an IP Profile is applied to a temporary copy of the image and its result is given to the OCR Engine to recognize characters on the page. Once text characters are obtained from the image, the temporary copy is discarded, leaving the original image to be used during Data Review and final export.

The IP Profile used for temporary IP will be set on an OCR Profile, using the IP Profile property.

Temp ip 1.png

The IP Profile is then executed for OCR during the Recognize activity. Set the OCR Profile property to the OCR Profile using the IP Profile.

Temp ip 2.png

After the Recognize activity is finished, the temporary IP image is discarded, leaving only the original image.

Any IP Command can be used for temporary IP except those that physically transform the image, resulting in the addition or subtraction of pixels. These commands are as follows:

  • Auto Deskew
  • Resize
  • Rotate
  • Warp
  • Auto Border Crop
  • Crop
  • Extract Page

Common IP Commands used in a temporary IP Profile include those in the "Feature Removal" category such as Negative Region Removal, Line Removal and Speck Removal. However, critical to an OCR Engine's ability to recognize characters is the need to be fed a black and white image. While OCR Engines will turn an image black and white on their own during their pre-processing phase, users do not have any control over how that is done. Grooper puts that control in your hands through the Threshold and Binarize commands.

Most, if not all, temporary IP Profiles will include a Threshold or Binaraze command to temporarily turn the image black and white.

FYI The only difference between the Threshold and Binarize commands is the processed image's bit depth. Threshold produces a 1-bit black and white image. Binarize produces an 8-bit grayscale image that only uses the black and white gradations. Functionally, both images are black and white. However, the image produced by Binarize can be passed to other IP Commands requiring an image with a larger bit depth than a single bit black and white image, such as commands using filters like a Gaussian blur. OCR Readers, however, will use both images in the same way.

Furthermore, binarization is used as a pre-processing command for several IP Commands. Many of Grooper's Image Processing capabilities work better if an image is in black and white instead of color (or simply won't work when using a color image). Take Line Detection for example. At its most basic level, a line on an image is a string of pixels one after the other. Grooper would have to pick out one string of colored pixels in a row and decide if it was a line or just a string of pixels that happened to be the same color. The result would be catastrophic. However, if you convert the color image into a black and white one, you only have two choices to make, is the pixel black or white? Is it a string of black pixels in a row? That's a line. Easy.