2023:Image Processing (Concept)

From Grooper Wiki

This article is about an older version of Grooper.

Information may be out of date and UI elements may have changed.

202520232.80

"Image processing", as a general term, refers to software techniques that manipulate and enhance images. Image processing removes imperfections and adjusts images to improve OCR accuracy. In Grooper, images are processed primarily by two Activities:

  • Image Processing - This Activity permanently adjusts the image using. It is primarily used to compensate for defects produced by a document scanner (like border artifacts and skewed images). It does so by applying IP Commands in an perm_media IP Profile.
  • Recognize - This Activity performs OCR. When an library_books OCR Profile references an perm_media IP Profile, the image will be processed temporarily. A temporary image is handed to the OCR engine and discarded once characters are recognized.
  • Grooper also has "computer vision" capabilities that analyze and interpret images. These capabilities are also executed during Grooper's image processing. For example, Grooper's "Line Removal" command will locate lines on an image (computer vision), remove those artifacts to improve OCR results during Recognize (image processing) and store that data for later use in Grooper (computer vision).

These operations generally fall into three categories:

  1. Archival Adjustments - These are permanent adjustments to the exported document's image.
  2. OCR Cleanup - Image cleanup can dramatically improve OCR results.
    • However, they can also drastically alter the document's image. Image adjustments are temporarily applied to a document prior to OCR when an IP Profile is executed during the Recognize activity. This is useful for non-destructive image clean up to improve OCR results, keeping the document's pages as their original image to preserve their archival images upon export.
  3. Layout Data Collection - This includes visual information used for data extraction purposes (such as table line locations, barcode information, OMR checkbox states) as well as image features used for Visual classification.
    • Layout Data can be collected either during the Image Processing or the Recognize activities.

You may download the ZIP(s) below and upload it into your own Grooper environment (version 2023). The first contains a Project with resources used in examples throughout this article. The second contains one or more Batches of sample documents.

Glossary

Batch Process: settings Batch Process nodes are crucial components in Grooper's architecture. A Batch Process is the step-by-step processing instructions given to a inventory_2 Batch. Each step is comprised of a "Code Activity" or a Review activity. Code Activities are automated by Activity Processing services. Review activities are executed by human operators in the Grooper user interface.

  • Batch Processes by themselves do nothing. Instead, they execute edit_document Batch Process Steps which are added as children nodes.
  • A Batch Process is often referred to as simply a "process".

Batch: inventory_2 Batch nodes are fundamental in Grooper's architecture. They are containers of documents that are moved through workflow mechanisms called settings Batch Processes. Documents and their pages are represented in Batches by a hierarchy of folder Batch Folders and contract Batch Pages.

Binarize: Binarize is an IP Command that converts a color or grayscale image to a bi-tonal (black and white) image using various thresholding methods.

Extract Page: Extract Page is an IP Command that removes an image from a carrier image while simultaneously removing any image warping or skewing.

Extract: export_notes Extract is an Activity that retrieves information from folder Batch Folder documents, as defined by Data Elements in a data_table Data Model. This is how Grooper locates unstructured data on your documents and collects it in a structured, usable format.

Image Processing: wallpaper Image Processing is an Activity that enhances contract Batch Page images and optimizes them for better OCR text recognition and data extraction results.

Image Processing: wallpaper Image Processing is an Activity that enhances contract Batch Page images and optimizes them for better OCR text recognition and data extraction results.

IP Command: IP Commands specify an image processing (IP) operation (such as image cleanup, format conversion or feature detection) and are used to construct image IP Steps in an IP Profile. IP Commands are configured using an IP Step's Command property.

IP Profile: perm_media IP Profiles are a step-by-step list of image processing operations (IP Commands). They are used for several image processing related operations, but primarily for:

  1. Permanently enhancing an image during the Image Processing activity (usually to get rid of defects in a scanned image, such as skewing or borders).
  2. Cleaning up an image in-memory during the Recognize activity without altering the image to improve OCR accuracy.
  3. Computer vision operations that collect layout data (table line locations, OMR checkboxes, barcode value and more) utilized in data extraction.

Layout Data: Layout Data refers to visual information Grooper certain IP Commands collect, such as lines, checkboxes, barcodes, and detected shapes. This data is stored in a "Grooper.Layout.json" file attached to contract Batch Pages. Layout data is used by certain extractors and other features that rely on the presence of that data to function.

Line Removal: Line Removal is an IP Command that locates and removes horizontal and vertical lines from documents. The detected line locations are stored as part of page's layout data.

OCR Engine: An "OCR engine" is the part of OCR software that recognizes text from images. OCR engines analyze the image's pixels to determine where text is on the page and what each character is. In Grooper, OCR engines are selected when configuring an OCR Profile's OCR Engine property.

OCR Profile: library_books OCR Profiles store configuration settings for optical character recognition (OCR). They are used by the Recognize activity to convert images of text on contract Batch Pages into machine-encoded text. OCR Profiles are highly configurable, allowing fine-grained control over how OCR occurs, how pre-OCR image cleanup occurs, and how Grooper's OCR Synthesis occurs. All this works to the end goal of highly accurate OCR text data, which is used to classify documents, extract data and more.

OCR: OCR is stands for Optical Character Recognition. It allows text on paper documents to be digitized, in order to be searched or edited by other software applications. OCR converts typed or printed text from digital images of physical documents into machine readable, encoded text.

Project: package_2 Projects are the primary containers for configuration nodes within Grooper. The Project is where various processing objects such as stacks Content Models, settings Batch Processes, profile objects are stored. This makes resources easier to manage, easier to save, and simplifies how node references are made in a Grooper Repository.

Recognize: format_letter_spacing_wide Recognize is an Activity that obtains machine-readable text from contract Batch Pages and folder Batch Folders. When properly configured with an library_booksOCR Profile, Recognize will selectively perform OCR for images and native-text extraction for digital text in PDFs. Recognize can also reference an perm_mediaIP Profile to collect "layout data" like lines, checkboxes, and barcodes. Other Activities then use this machine-readable text and layout data for document analysis and data extraction.

Review: person_search Review is an Activity that allows user attended review of Grooper's results. This allows human operators to validate processed contract Batch Page and folder Batch Folder content using specialized user interfaces called "Viewers". Different kinds of Viewers assist users in reviewing Grooper's image processing, document classification, data extraction and operating document scanners.

Shape Removal: Shape Removal is an IP Command detects and removes shapes from documents. Common shapes targeted by this command are stamps, seals, logos or other graphical marks that interfere with OCR and/or can serve as triggers for document separation or anchors for data extraction. The detected shapes' locations are stored as part of page's layout data.

Visual: "Visual" is a Classify Method that uses image analysis instead of text data to determine the description Document Type assigned to a folder Batch Folder during classification. Instead of using text-based extractors, an "Extract Features" IP Command in an perm_media IP Profile is used to collect image-based data from a Batch Folder's image(s). This image-based data is compared against that of previously trained document examples of each Document Type to classify the Batch Folder.

About Grooper's Image Processing Software

Regardless of how good an OCR Engine is, OCR is very rarely perfect. Characters can be segmented out from words wrong. Artifacts such as table lines, check boxes or even just specks from image noise can interfere with character segmenting and character recognition. Even when they are segmented out correctly, the OCR engine's character recognition can make the wrong decision about what the character is.

Image Processing (often abbreviated as "IP") can assist the OCR operation by providing a "cleaner" image to the OCR Engine. The general idea is to give the OCR engine just the text pixels, so that is all the engine needs to process.

This image is much easier for OCR to process... ...than this image.


IP Profiles

Images are altered using an IP Profile, which contains a step by step list of IP Commands, each of which performs a specific alteration to the image. IP Profiles are highly configurable. There are multiple different IP Commands, each of which has its own configurable properties as well.

In the example above, the image was altered using an IP Profile with six steps, each step containing a different IP Command.

This shows the list of steps in this IP Profile, each one named for the IP Command used: Auto Border Crop, Binarize, Shape Removal, Line Removal, Speck Removal, and Blob Removal

Order of operation matters! The image is altered step by step, from the first to the last. The first step hands the second step the results of its IP Command. The second step runs using the mutated image not the original image. The second step then hands its result to the third step and so on and so on.

For this IP Profile, the first step runs the Auto Border Crop command. It crops the image, removing its border.

In this case, the border was actually part of the document. Usually, borders appear around documents because of how they were scanned. However, the goal is still the same. Remove superfluous, non-text pixels interfering with the OCR operation.

Before After

The second step runs the Binarize command. The Binarize command turns the image black and white.

OCR requires a black and white image to analyze pixels and segment them out into characters. It needs a binary representation of the image: "Is text" or "is not text" While OCR engines will binarize an image as part of their pre-processing phase, doing so in an IP Profile gives you control over how that operation is done. There is more than one way to binarize an image, and you have no control over it if you let the OCR engine do it for you.

Before After

The next step runs the Shape Removal command. Shape Removal takes trained examples of shapes, in this case the company logo, locates them on a document and removes them.

While probably not strictly necessary for this example, removing the shape does give OCR one less thing the OCR engine has to look at when segmenting out characters and figuring out what characters these pixels should be.

Before A dropout mask is created for
the detected sample shape.
After

The last three commands are Line Removal, Speck Removal, and Blob Removal. These three commands find three different types of pixel artifacts (lines, specks and blobs) and remove them, further isolating pixels that are only text characters.

Line Removal
Before A dropout mask is created
for detected lines.
After
Speck Removal
Before A dropout mask is created for small specks.
Here, mostly getting rid of dotted lines.
After


Blob Removal
Before A dropout mask is created for
detected blobs of a defined size.
After

You are left with an image that the OCR engine can much easier break up into line, word and finally character segments, vastly improving the accuracy of character recognition.

A portion of OCR results without applying the IP Profile The same results with the IP Profile applied.


However, for the example above, the IP Profile's result is drastically different from the original image. While it certainly helps the OCR result, it's likely, at the end of the process, you want to export a document that looks more like the "before" picture than the "after". Luckily, Image Processing can be performed in two ways:

  1. Permanent for archival purposes.
  2. Temporary for non-destructive OCR cleanup.

Permanent IP

The Image Processing activity's property panel. Permanent IP is done via the Image Processing activity, using an IP Profile set on the IP Profile property.

Permanent Image Processing is done via the Image Processing activity. It is, as the name implies, a permanent alteration of the document's image. The Image Processing activity will reference an IP Profile and permanently apply its IP Commands to the document images. Once that image is changed, it is changed for the remainder of the Batch Process. IP Profiles used by the Image Processing activity should only use commands acceptable for final export.

FYI

The Image Processing activity permanently alters the image. There is no going back! Except when there is. By default there is no "Undo" option to revert to the original, pre-processed image. However, if you turn the Enable Undo property to True, Grooper will save the original image to the filestore. Note this makes a second copy of the image, which can significantly increase the memory requirements of your filestore.

The three categories of most commonly used IP Commands for permanent cleanup are

  • Border Cleanup - These commands clean up border artifacts around an image by cropping the image or filling in the border with a given color.
  • Color Adjustment - These commands adjust the color values of the image, including brightness, color saturation, and contrast.
  • Image Transforms - These commands change the image's size and orientation.

Of the commands in those categories, there are one or two that are particularly common.

Border Cleanup

Auto Border Crop, Border Fill

Color Adjustment

Brightness Contrast, Contrast Stretch

Image Transforms

Auto Deskew, Auto Orient

Temporary IP

There's a great deal of image processing that is extremely helpful for OCR, but ultimately renders the document almost indistinguishable from the original. Grooper allows you to temporarily alter the image before OCR is performed during the Recognize activity. In this case, an IP Profile is applied to a temporary copy of the image and its result is given to the OCR Engine to recognize characters on the page. Once text characters are obtained from the image, the temporary copy is discarded, leaving the original image to be used during Data Review and final export.

The IP Profile used for temporary IP will be set on an OCR Profile, using the IP Profile property.

  1. Select an OCR Profile.
  2. Click the hamburger menu to the right of the IP Profile Property to access the drop down menu.
  3. Navigate and select the IP Profile you wish to add to the OCR Profile.

The IP Profile is then executed for OCR during the Recognize activity. Set the OCR Profile property to the OCR Profile using the IP Profile.



After the Recognize activity is finished, the temporary IP image is discarded, leaving only the original image.

Any IP Command can be used for temporary IP except those that physically transform the image, resulting in the addition or subtraction of pixels. These commands are as follows:

  • Auto Deskew
  • Resize
  • Rotate
  • Warp
  • Auto Border Crop
  • Crop
  • Extract Page

Common IP Commands used in a temporary IP Profile include those in the "Feature Removal" category such as Negative Region Removal, Line Removal and Speck Removal. However, critical to an OCR Engine's ability to recognize characters is the need to be fed a black and white image. While OCR Engines will turn an image black and white on their own during their pre-processing phase, users do not have any control over how that is done. Grooper puts that control in your hands through the Threshold and Binarize commands.

Most, if not all, temporary IP Profiles will include a Threshold or Binaraze command to temporarily turn the image black and white.

FYI

The only difference between the Threshold and Binarize commands is the processed image's bit depth. Threshold produces a 1-bit black and white image. Binarize produces an 8-bit grayscale image that only uses the black and white gradations. Functionally, both images are black and white. However, the image produced by Binarize can be passed to other IP Commands requiring an image with a larger bit depth than a single bit black and white image, such as commands using filters like a Gaussian blur. OCR Readers, however, will use both images in the same way.

Furthermore, binarization is used as a pre-processing command for several IP Commands. Many of Grooper's Image Processing capabilities work better if an image is in black and white instead of color (or simply won't work when using a color image). Take Line Detection for example. At its most basic level, a line on an image is a string of pixels one after the other. Grooper would have to pick out one string of colored pixels in a row and decide if it was a line or just a string of pixels that happened to be the same color. The result would be catastrophic. However, if you convert the color image into a black and white one, you only have two choices to make, is the pixel black or white? Is it a string of black pixels in a row? That's a line. Easy.