Extract Page

From Grooper Wiki
Jump to navigation Jump to search
1573742604267-578.png

Extract Page is an IP command that removes an image from a carrier image while simultaneously removing any image warping or skewing.

A carrier image, for our purposes here, is simply the original input image containing a document. If you've ever deposited a check from a mobile phone, you've sent your bank a carrier image of the check. The document, in this case a check, is "carried" to another application where the check is removed from the background and straightened out to extract the deposit amount and account number.

Extract Page works much the same way. This IP Command extracts a page from a dark background or a light background where border edges are visible. It detects the edges of a page forming a quadrilateral shape, “cuts out” the page from the background. At the same time it repairs any skew, shear or perspective warping, producing a straight, flat image.


About

Below is an example of how Extract Page works. The document on the left is from the carrier image. It is skewed and has a hefty black border around it. The document on the right is the result of the Extract Page operation. The document is straightened out and the border is removed.

The original image
1573659273321-758.png
The extracted image
1573832146767-464.png

Version Differences

Extract Page is new to version 2.80. In previous versions, it would be necessary to approximate this functionality through multiple IP Commands, including Auto Border Crop, Auto Deskew, and Warp.

Use Cases

Documents on a microfiche card

Extract Page was developed specifically for Grooper's microfiche processing capabilities. Documents exist on a microfiche card in rows and columns. Grooper's microfiche activities make individual document images out of the matrix of documents on the fiche card. However, a slight black border persists to account for variations on how the documents were originally reproduced on the fiche card. Furthermore, since a fische card is created by taking a film image of documents, it is common for slight skewing and warping to occur, especially towards the edges of the film. The Extract Page command resolves both these issues.

That does not mean the command is limited to microfiche processing. Extract page could be used any time documents need to be removed from a dark background (or a light background where the edges of the document are discernible from the background), and warped or skewed images need adjusting.


How To: Add the command to an IP Profile

Before you begin

This guide assumes you've created an IP Profile and have a Test Batch ready to configure the Extract Page command.

Add Extract Page to the IP Profile

1. Navigate to your IP Profile in the "IP Profiles" folder in the "Global Resources" folder in the Node Tree.

2. Press the "Add" button to add a new IP Command to your IP Profile.


1573832404164-783.png


3. Select the "Image Transforms" category. Then, select "Extract Page".


1573832950498-787.png


Turn the document black and white & define its borders

Binarization
1573833028046-933.png


1. In order for Extract Page to work, color and grayscale images must first be "binarized", or converted to black and white. Binarization converts color images to black and white by "thresholding" the image.  Thresholding is the process of setting a threshold value on the pixel intensity of the original image.  Pixel intensity is a pixel's "lightness" or "brightness".  Essentially, once a midpoint between the most intense ("whitest") and least intense ("blackest") pixel on a page is established, lighter pixels are converted to white and darker are converted to black.  Or put another way, pixels with an intensity value above the threshold are converted to white, and those below the threshold are converted to black.  This midpoint (or "threshold") can be set manually or found automatically by a software application.  The Thresholding Method can be set to one of four ways: 

  • Simple - Thresholds an image to black and white using a fixed threshold value between 0 and 255.
  • Auto - Selects a threshold value automatically using Otsu's Method.
  • Adaptive - Thresholds pixels based on the intensity of pixels in the local neighborhood.
  • Dynamic - Performs adaptive thresholding, while preserving dark areas on the page.

Each method has its own configurable properties. For more information on binarization and these methods, visit the Binarize article.

Border Size
1573836254056-603.png


2. You also must set where on the page edge detection will be performed. This is done using the "Border Size" property. The idea here is to define a rectangle that will fall inside the page you want to extract. When configuring this property, use the "Zoning" diagnostic image. The blue rectangle is the defined border region.

The default Border Size of 0.25 inches from each edge is not going to work here.  The region only overlaps part of the document. Here, the Border Size is set to 0.5 inches from each edge.  The region completely overlaps the document.  With further configuration, Extract Page will work successfully with this Border Size.
1573833789329-699.png
1573833796480-805.png


Set line detection settings

The image is extracted by first finding the document's edges. Once Grooper knows where the lines around a document are inside the carrier, it can digitally cut around those lines to remove it. There are two configurable properties for line detection, "Angle Precision" and "Threshold".

Angle Precision
1573836517291-285.png


"Angle Precision" sets the angle increment for each of the four lines that make up the extracted image. "1/64 degrees" is the most precise. "1 degree" is the least. In the example below, angle precision was first set to "1 degree". Only the right edge was detected because it was at least a full degree's difference from vertical. All the other three edges are less than a full degree's difference from horizontal or vertical. The operation could not reliably determine their angle, and they were not detected. When set to "1/4 degree" precision, it found the remaining three sides, and the image could be extracted.

1 degree angle precision. Only one line found (Seen in red). Page cannot be extracted
1573598823380-456.png
1573598833416-651.png
1/4 degree angle precision.  All four lines found (Seen in red). Page is extracted.
1573598828027-424.png
1573598838574-603.png


Threshold
1573836429089-612.png


"Threshold" determines what should count as a line when detected by setting a percentage of the image width or height the line must occupy to be considered an edge. The default setting here is 75%. So, if the left (vertical) line of the image you want to extract is at least 75% as long as height of the whole image, the page will extract.

Again, the "Zoning" diagnostic image will help you configure this setting. Detected lines are seen in red. You want all four edges of the extracted image lined with a red line. Here, the image is (roughly) 4" by 5" and the white rectangle is 3" by 4". The top and bottom edge of the white rectangle is 75% the width of the image. The left and right edges are 80%. So, a Threshold of 75% should work here. 1573835538057-538.png


As a word of caution, if the image is sheared, rotated, or otherwise warped, you may need to lower the threshold lower than you may think. This is the same page, same size, sheared slightly to the left. The Threshold had to be lowered to 69% in order to extract the page. Also note, warped images like these are where finer Angle Precision can help improve Extract Page's results. 1573835871816-103.png


De-warp the image

After Extract Page can find the document's lines, it will cut it out of the carrier image. The image is then automatically de-warped. Any skewing, shearing, or rotation will automatically fixed. There are three different "Interpolation Modes" to choose from: Linear, Cubic or NearestNeighbor.

1573836834194-409.png


This choice effects the speed at which the image is de-warped and the quality of the final image. The better quality the final image, the slower the operation. "Cubic" is the slowest but most accurate. "NearestNeighbor" is the fastest but least accurate. "Linear" is in between both in terms of speed and accuracy (This is also the default setting).

You can really see the difference between NearestNeighbor and the other two below. If your document is fairly simple like the one we've been using as an example, it's likely there won't be much difference between Linear and Cubic. However, if your document has images or complicated table lines or a lot going on otherwise, Cubic may be necessary if you want the most accurate de-warping performed.


Cubic Linear Nearest Neighbor
1573838498967-406.png
1573838502838-341.png
1573838505844-197.png


The last property available is "Apply Edge Smoothing". When images are de-warped, pixels from lines that were straight on the original document are put back together and made straight again. However, the operation is never 100% perfect. Some distortion will necessarily occur. Seen above, even the most accurate interpolation mode produces slightly jagged lines. Edge smoothing attempts to even out these lines while preserving their edge. This is a "True" or "False" property. So, if turned on it will either successfully smooth edges or it won't. There's no configurable properties to get it to work "better".


1573839430384-312.png


Property Details

Property Default Value Information
General Properties
Binarization Auto Binarization converts color images to black and white by "thresholding" the image. Thresholding is the process of setting a threshold value on the pixel intensity of the original image.  Pixel intensity is a pixel's "lightness" or "brightness".  Essentially, once a midpoint between the most intense ("whitest") and least intense ("blackest") pixel on a page is established, lighter pixels are converted to white and darker are converted to black.  Or put another way, pixels with an intensity value above the threshold are converted to white, and those below the threshold are converted to black.  This midpoint (or "threshold") can be set manually or found automatically by a software application. The Thresholding Method can be set to one of four ways:
  • Simple - Thresholds an image to black and white using a fixed threshold value between 1 and 255.
  • Auto - Selects a threshold value automatically using Otsu's Method.
  • Adaptive - Thresholds pixels based on the intensity of pixels in the local neighborhood.
  • Dynamic - Performs adaptive thresholding, while preserving dark areas on the page.

Each method has its own set of configurable properties. For more information on binarization and these methods, visit the Binarize article.

Border Size 0.25in Controls the border size for detecting the edges of a document.
Line Detection Properties
Angle Precision 1/8 The precision at which edge angles are detected, ranging from 1/64 degrees to 1 degree.
Threshold 75% "Threshold" determines what should count as a line when detected by setting a percentage of the image width or height the line must occupy to be considered an edge. The default setting here is 75%. So, if the left (vertical) line of the image you want to extract is at least 75% as long as height of the whole image, the page will extract. (Note, skewed, sheared or otherwise warped image might need a lower threshold than you may think. The "Zoning" diagnostic image will help configure this property.
Warp Settings
Interpolation Mode Linear Determines accuracy and speed of the warp operation. NearestNeighbor is the fastest and least accurate. Cubic is the most accurate but slowest. Linear is in between both.
Apply Edge Smoothing False When images are de-warped, pixels from lines that were straight on the original document are put back together and made straight again. However, the operation is never 100% perfect. Some distortion will necessarily occur.

Edge smoothing attempts to even out these lines while preserving their edge. This is a "True" or "False" property. So, if turned on it will either successfully smooth edges or it won't. There's no configurable properties to get it to work "better".