Extract Page (IP Command)

From Grooper Wiki

This article is about the current version of Grooper.

Note that some content may still need to be updated.

2025 2.80

Extract Page is an IP Command that removes an image from a carrier image while simultaneously removing any image warping or skewing.

The Extract Page IP Command in Grooper is a powerful image processing tool designed to automatically locate, extract, and de-warp a document page from a larger carrier image. This command is especially useful for images where a document page is scanned or photographed on a background, such as a flatbed scanner or a camera-captured surface, and the page is not perfectly aligned or may be subject to perspective distortion.

What is Extract Page?

The Extract Page command is an IP Command that detects the four edges of a document page within a larger image, even if the page is skewed, rotated, or warped. It then applies a de-warping (warp) operation to produce a new, rectangular image of the page, correcting for any perspective or alignment issues.

This process is essential for preparing images for downstream tasks such as OCR, data extraction, or archival, ensuring that the page content is properly aligned and free from background noise or distortion.

Before: Skewed page on black carrier background

After: Extracted and de-warped page

When and why to use Extract Page

Use the Extract Page command when you have images where the document page is not perfectly cropped or aligned, such as:

  • Scans of pages placed on a flatbed scanner with visible background.
  • Camera-captured images of documents on a desk or other surface.
  • Documents with visible page outlines but not perfectly rectangular in the image.

Extract Page is ideal for automating the cleanup of such images, removing unwanted backgrounds, and correcting for skew, shear, or perspective distortion. This results in higher quality images for recognition and extraction processes.

How Extract Page works

The Extract Page command follows these general steps:

  1. The input image is binarized using the configured "Binarization" settings to enhance edge contrast.
  2. Edge detection is performed in the border regions of the image to locate the top, right, bottom, and left page edges.
  3. Each edge is detected independently, allowing for accurate extraction even if the page is skewed, rotated, or subject to perspective distortion.
  4. The four detected edges define a quadrilateral, which is then de-warped using a warp operation to produce a new, rectangular image of the page.

Configuration and usage

Extract Page is typically used as part of an IP Profile, which is a sequence of image processing steps. Each step is represented by an IP Step, which specifies an IP Command to execute—such as Extract Page.

To use Extract Page:

  1. Add an IP Step to your IP Profile and set its "Command" property to Extract Page.
  2. Configure the following key properties to tune the extraction process:
    1. "Binarization": Controls how the image is converted to black and white before edge detection. Proper binarization is essential for reliable page extraction.
    2. "Border Size": Defines the thickness of the region along each edge of the image where the algorithm searches for page edges. You can specify unit-aware values (e.g., 0.25in, 10px, 6mm).
    3. "Angle Precision": Sets the angular increment (in degrees) for edge angle detection. Lower values increase precision but may slow processing.
    4. "Threshold": Defines the minimum strength required for a detected line to be considered a valid page edge, as a percentage of the image width or height.
    5. "Interpolation Mode": Determines the algorithm used to resample pixels during the warp operation. "Cubic" is recommended for most documents.
    6. "Apply Edge Smoothing": Enables or disables additional smoothing of the page edges during the warp operation.

Example scenarios

  • Light page on dark background: The page is placed on a black or significantly darker background. The algorithm detects the strong contrast at the page edges.
  • Light page on light background: The page is on a background similar in color to the page itself, but the outline is still visible. The algorithm detects subtle edge features.

Diagnostics and tuning

When run in diagnostic mode, Extract Page generates output images showing the binarized input, detected edges, and the quadrilateral region used for extraction. These diagnostics are essential for tuning the command for your specific document types.

  • Use the "Binarized" diagnostic image to review the effect of your binarization settings.
  • Use the "Edges" diagnostic image to see which lines are detected.
  • Use the "Zoning" diagnostic image to visualize the border region and detected edges.

Supported pixel formats

Extract Page supports all common pixel formats, including Pixel8bppGrayscale, Pixel24bppBgr, and Pixel1bppIndexed. Images are automatically converted as needed for edge detection and warping.

Best practices

  • Start with the default settings and review diagnostic images to verify edge detection.
  • Adjust "Binarization" and "Border Size" to ensure page edges are clearly visible.
  • Fine-tune "Angle Precision" and "Threshold" for your specific document and image conditions.
  • Use "Cubic" interpolation for best quality unless performance is a concern.
  • Enable "Apply Edge Smoothing" if the output image shows jagged or rough edges.

Integration in IP Profiles

Extract Page can be combined with other IP Commands in an IP Profile to create a complete image processing workflow. Each command is applied in sequence, with the output of one command serving as the input to the next.

Summary

The Extract Page IP Command is a robust solution for extracting and de-warping document pages from carrier images in Grooper. By accurately detecting page edges and correcting for perspective, it ensures that your images are clean, properly aligned, and ready for further processing.

For more information on configuring and using Extract Page, see the documentation for IP Command, IP Step, and IP Profile.