Line Detection and Line Removal

From Grooper Wiki
(Redirected from Line Detection)

Line Detection and Line Removal are two closely related IP Commands in Grooper. Both are designed to identify horizontal and vertical lines in document images, such as those found in forms, tables, or pre-printed backgrounds. Both commands generate layout data describing the position and orientation of detected lines, which can be used for downstream processing. However, while Line Detection only detects and analyzes lines, Line Removal will also remove lines from the image, cleaning up visual noise that may interfere with OCR or data extraction.

What is Line Detection?

Line Detection is an IP Command that locates horizontal and vertical lines on documents. The detected line locations are stored as part of page's layout data.

Line Detection is an IP Command in Grooper. It can be added to a step in an IP Profile or IP Group. The Line Detection command is designed to identify horizontal and vertical lines from document pages. This is essential for various extractors and features which use line locations as part of their function. For example, the Tabular Layout table extract method will use line locations to improve how it detects row, column, and cell boundaries.

When the Line Detection command is executed, it:

  1. Optionally preprocesses the image (including binarization and font dropout) to enhance line visibility.
  2. Scans for horizontal and vertical runs of black pixels, applying configurable thresholds for length, thickness, aspect ratio, and fill.
  3. Detects lines using a combination of morphological and geometric analysis, with support for advanced features such as comb detection, speck removal, and dash sequence handling.
  4. Saves layout data describing the position and orientation of detected lines, which can be used for table extraction, form field alignment, or visual overlays.

Line Detection does not alter the original image. Its primary output is the set of detected line objects, which can be used by downstream layout analysis or extraction logic.

Use cases for Line Detection

Grooper can use the layout data generated by Line Detection to:

  • Assist the Tabular Layout and Grid Layout table extract methods in identifying table boundaries and grid structures.
  • Help Tab Marking break up a document's text structure, inserting tabs where vertical lines are present.
  • Assist Paragraph Marking in determining divisions between paragraphs.
  • Generate layout data for various other features that utilize line locations, such as extractors that snap results to nearby line locations with their "Line Snap Options".
  • Provide visual overlays for reviewing detected lines in diagnostics or the UI.

General configuration steps

To use Line Detection:

  1. Right-click an IP Profile or IP Group to add a Line Detection IP Step.
  2. Select "Add Command" then "Feature Detection" then "Line Detection".
  3. Adjust detection settings as needed.
    • The detection settings are:
      • Minimum Line Length
      • Maximum Line Thickness
      • Maximum Line Gap
      • Minimum Aspect Ratio
      • Maximum Edge Noise
      • Trim Distance
      • Minimum Run Length
      • Advanced Line Detection
      • Dash Detection
    • The default settings work well for most scenarios. Check out the About Line Detection settings section for more information on each of these properties.
  4. Adjust "Comb Removal" settings as needed.
    • Combs are short line segments connected to perpendicular lines. Think about forms that have grid lines or boxes for you to print each letter of your name. These are called "comb boxes". The vertical lines connected to horizontal lines are called "combs".
    • The Comb Removal settings are:
      • Comb Removal
      • Minimum Length
      • Minimum Fill
      • Quiet Zone Size
      • Minimum Weight
      • Maximum Speck Size
    • The default settings work well for most scenarios. Check out the About Line Detection settings section for more information on each of these properties.
  5. Adjust "Image Preprocessing" settings as needed.
    • These settings optimize line visibility, improving Line Detection's ability to detect lines. Preprocessing adjustments only occur prior to detecting lines and do not alter the final image.
    • "Binarization Settings" control how color or grayscale images are turned black and white. Line Detection must occur on a black and white image.
    • "Dropout Font Size" defines the largest font size to be dropped out during preprocessing.

Advanced settings for comb detection, speck removal, and dash sequence handling as needed.

  1. Test the IP Step/IP Group/IP Profile using the "Tester" tab. Use the Diagnostics panel's images and files to visualize results.
    • Review the "Binarized" and "Preprocessed" diagnostic images to ensure lines are clearly separated from the background.
    • Review the "Dropout Mask" and "Trim Mask" diagnostics to see which regions would be affected by line removal.
    • Fine-tune the Line Detection settings as needed.
  2. The Line Detection step will execute as part of the IP Profile's execution flow.

The detected line information is stored as layout data, making it available for downstream activities such as data extraction, classification, or routing.

What is Line Removal?

Line Removal is an IP Command that locates and removes horizontal and vertical lines from documents. The detected line locations are stored as part of page's layout data.

Line Removal is an IP Command in Grooper. It can be added to a step in an IP Profile or IP Group. Line Removal has two purposes:

  • Like Line Detection, it identifies horizontal and vertical lines in document images, generating layout data for downstream use.
  • Line Removal goes one step further by digitally erasing detected lines from the image. This can eliminate unwanted line content before OCR occurs, improving OCR accuracy and subsequent processing downstream.

When the Line Removal command is executed, it:

  1. Optionally preprocesses the image (including binarization and font dropout) to enhance line visibility.
  2. Detects lines using the same configurable detection logic as Line Detection.
  3. Generates a dropout mask to cover each detected line, optionally including a trim distance around lines to ensure clean removal.
  4. Removes the masked regions from the image according to the "Dropout Method" setting.
  5. Outputs the cleaned image with lines removed, as well as layout data describing the detected lines.

Line Removal is most useful for cleaning up forms, tables, or pre-printed backgrounds prior to OCR or data extraction, ensuring that only relevant content remains on the image.

Use cases for Line Removal

Line Removal's use cases include all those of Line Detection, plus:

  • Cleaning up document images before OCR to prevent lines from interfering with text recognition.
  • Removing table borders, underlines, or grid lines that are not needed for downstream processing.
  • Ensuring that only relevant content remains on the image for data extraction or export.

General configuration steps

To use Line Removal:

  1. Right-click an IP Profile or IP Group to add a Line Removal IP Step.
  2. Select "Add Command" then "Feature Removal" then "Line Removal".
  3. Adjust detection settings as needed.
    • The detection settings are:
      • Minimum Line Length
      • Maximum Line Thickness
      • Maximum Line Gap
      • Minimum Aspect Ratio
      • Maximum Edge Noise
      • Trim Distance
      • Minimum Run Length
      • Advanced Line Detection
      • Dash Detection
    • The default settings work well for most scenarios. Check out the About Line Detection settings section for more information on each of these properties.
  4. Adjust "Comb Removal" settings as needed.
    • Combs are short line segments connected to perpendicular lines. Think about forms that have grid lines or boxes for you to print each letter of your name. These are called "comb boxes". The vertical lines connected to horizontal lines are called "combs".
    • The Comb Removal settings are:
      • Comb Removal
      • Minimum Length
      • Minimum Fill
      • Quiet Zone Size
      • Minimum Weight
      • Maximum Speck Size
    • The default settings work well for most scenarios. Check out the About Line Detection settings section for more information on each of these properties.
  5. Adjust "Image Preprocessing" settings as needed.
    • These settings optimize line visibility, improving Line Detection's ability to detect lines. Preprocessing adjustments only occur prior to detecting lines and do not alter the final image.
    • "Binarization Settings" control how color or grayscale images are turned black and white. Line Detection must occur on a black and white image.
    • "Dropout Font Size" defines the largest font size to be dropped out during preprocessing.
  6. Configure "Dropout Method" settings as needed.
    • There are two Dropout Methods to choose from:
      • Fill (Default) - Replaces lines with a solid color. This method is generally sufficient for temporarily removing lines when applied by an OCR Profile.
        • By default, the Fill Color is set to "none". When set to "none" Grooper will detect a color from the image's background and use that.
      • Inpaint - Digitally restores masked regions by estimating and filling in missing or damaged areas using advanced algorithms. When removing lines that overlap text, this method can help preserve overlapping text.
    • (Optional) The "Mask Dilation Factor" will dilate or erode the dropout mask applied to the image. Expanding this value will help ensure no stray artifacts attached to lines remain, but setting it too high may remove nearby content.
      • Setting the "Trim Distance" property will also control how much area around lines is removed.
  7. Test the IP Step/IP Group/IP Profile using the "Tester" tab. Use the Diagnostics panel's images and files to visualize results.
    • Review the "Binarized", "Preprocessed", "Dropout Mask", and "Output Image" diagnostics to ensure lines are being removed as intended.
    • Fine-tune the Line Removal settings as needed.
  8. The Line Removal step will execute as part of the IP Profile's execution flow.

Similarities and differences between Line Detection and Line Removal

Similarities

The big takeaway: Both Line Detection and Line Removal detect lines and generate layout data.

  • Both commands detect horizontal and vertical lines in document images.
  • Both generate layout data describing the position and orientation of detected lines, which can be used for downstream processing.
  • Both use the same detection technology and support the same range of pixel formats.
  • Both can be configured with the same detection properties, including thresholds for line length, thickness, aspect ratio, fill, and advanced features like comb detection and speck removal.
  • Both provide diagnostic output to assist with configuration and troubleshooting.
  • Both are included as steps in an IP Profile, allowing them to be combined with other image processing operations.

Differences

The big takeaway: Line Detection does not alter the image. Line Removal does alter the image.

  • Both Line Detection and Line Removal detect lines and store their information as layout data.
  • Only Line Removal removes detected lines from images, masking them out before further processing.
    • This is most useful for image cleanup prior to OCR or data extraction.
    • Lines will be removed from an image permanently when Line Removal is applied by the Image Processing activity.
    • Lines will be removed in-memory prior to OCR when Line Removal is applied by the Recognize activity.
  • Only Line Removal has Feature Removal properties such as "Dropout Method" and "Trim Distance".

When to use each command

  • Use Line Detection when you only need to extract line information for table extraction, field alignment, or layout analysis without altering the image.
  • Use Line Removal when you want to eliminate lines or to remove line-like noise prior to OCR or data extraction. Line Removal will extract line information as well.

About Line Detection settings

Line Detection and Line Removal have a lot of configurable properties. In the majority of cases, the default properties work great. However, each property allows you to fine-tune detection for your specific documents, a table structure, or form layout.


Below is a detailed explanation of each property, including its purpose, usage, and configuration guidance. Only the properties that both Line Detection and Line Removal share are documented here.

Detection Settings properties

Minimum Line Length
Specifies the shortest allowable length for a feature to be considered a line. Accepts unit-aware values (e.g., 0.2in, 10px, 5pt), and can be set independently for horizontal and vertical lines. Increase to ignore short marks or noise; decrease to detect short lines such as underlines or checkboxes.
Maximum Line Thickness
Sets the thickest a feature can be to be considered a line. Measured perpendicular to the line's orientation. Lower values detect only fine lines; higher values allow bold or double lines. Adjust to avoid detecting boxes or filled shapes as lines.
Maximum Line Gap
Defines the largest break (in display units, e.g., 1pt) that can occur within a line while still treating it as a single, continuous line. Increase to detect dashed or broken lines; decrease to require more solid lines.
Minimum Aspect Ratio
Sets the minimum ratio of length to thickness for a feature to be classified as a line. Higher values require lines to be longer and thinner, filtering out boxes or blobs. Lower values allow shorter or thicker lines. Example: a value of 10 requires a line to be at least 10 times longer than it is thick.
Minimum Line Fill
Defines how solid a feature must be to be detected as a line. Fill percentage is the ratio of black pixels along the line's path to the total possible pixels. Set to 1.0 (100%) to detect only solid lines; lower values allow detection of dotted, dashed, or faint lines.
Maximum Edge Noise
Sets the highest percentage of non-line (noise) pixels permitted along the edge of a line for it to be considered valid. Lower values require cleaner lines; higher values allow detection of lines with more adjacent noise or touching marks.
Trim Distance
Sets the radius (in display units, e.g., 2px) around each detected line within which connected components (such as text or marks) will be trimmed or removed. Increase to remove more area around lines; decrease to preserve nearby features.
Minimum Run Length
Defines the shortest continuous sequence of black pixels (in display units) that will be considered as part of a line during detection. Increase to require longer uninterrupted runs; decrease to detect shorter lines or underlines.
Advanced Line Detection (Level)
Controls the use and precision of advanced line detection based on the Hough transform. Options include Off, Low, Medium, and High. Higher levels increase sensitivity and precision, but may increase processing time. Use for faint, broken, or skewed lines.
Angle Tolerance
Sets the allowable angular deviation from 0° (horizontal) or 90° (vertical) for detected lines. Lower values require lines to be nearly perfectly horizontal or vertical; higher values allow detection of lines with more skew. Useful for handling slightly skewed or rotated forms.
Dash Detection
Enables or disables detection of dashed lines. When enabled, the command will attempt to detect and process dashed or dotted line patterns as single lines.

Comb Removal properties

Comb Removal
Controls whether and how combs (short line segments connected to perpendicular lines) are detected and removed. Options include None, Single, Double, and All. Use to clean up checkboxes, grid stubs, or dense table grids.
Minimum Length
Sets the shortest length (in display units) for a line segment to be classified as a comb. Must be less than "Minimum Line Length". Adjust to match the size of checkboxes or grid stubs.
Minimum Fill
Defines how solid a rectangle must be to be classified as a comb. Fill percentage is the ratio of black pixels to the total area. Higher values require combs to be mostly solid, reducing false positives.
Quiet Zone Size
Defines a region on each side of a single-connected comb that must be relatively free of feature pixels. Increase to require more whitespace around combs, reducing false positives.
Minimum Weight
Works with the quiet zone to ensure that the area around a comb is sufficiently clear of other features. Higher values require stricter separation between combs and surrounding features.
Maximum Speck Size
Sets the largest size (in display units) for artifacts (specks) that touch the edge of a line and should be removed during comb removal. Increase to remove larger artifacts; decrease to preserve small marks.
  • While technically this is grouped in the Comb Removal settings, speck removal will occur regardless of if Comb Removal is enabled or disabled.

Image Preprocessing properties

Binarization Settings
Specifies the method and parameters used to convert color or grayscale images to black and white for line detection. Options include Auto, Adaptive, and Simple. Tuning binarization is essential for reliable line detection.
Dropout Font Size
Sets the maximum font size (in display units, e.g., 14pt) to be dropped out during image preprocessing. Font dropout removes text from the image before line detection, helping to prevent text from being mistaken for lines.

Summary Table

Property Purpose Typical Usage

Detection Settings

Minimum Line Length Shortest length for a line Ignore short marks or detect underlines
Maximum Line Thickness Thickest allowed line Detect fine or bold lines
Maximum Line Gap Largest break allowed in a line Detect dashed/broken lines
Minimum Aspect Ratio Length-to-thickness ratio Filter out boxes/blobs
Minimum Line Fill How solid a line must be Detect faint/dotted lines
Maximum Edge Noise Allowed noise along line edge Handle noisy images
Trim Distance Area trimmed around lines Remove fragments near lines
Minimum Run Length Shortest run of black pixels Filter out specks/noise
Advanced Line Detection Hough transform level Detect skewed/faint lines
Angle Tolerance Allowed skew angle Handle rotated forms
Dash Detection Detect dashed lines Process dotted/dashed lines

Comb Removal

Comb Removal Remove combs (short stubs) Clean up comb boxes/grids
Minimum Length Shortest comb length Target checkboxes/stubs
Minimum Fill Fill for combs Reduce false positives
Quiet Zone Size Whitespace around combs Avoid misclassifying text
Minimum Weight Quiet zone fill Stricter comb separation
Maximum Speck Size Remove small artifacts Clean up noise

Image Preprocessing

Binarization Settings Convert to black/white Optimize detection
Dropout Font Size Remove text before detection Prevent text/line confusion

Configuration tips

  • Start with default values and review diagnostic images to assess detection quality.
  • Adjust properties incrementally, testing on multiple sample documents.
  • Use diagnostic outputs ("Binarized", "Preprocessed", "Dropout Mask", etc.) to visualize the effect of each property.
  • Fine-tune for your specific document set, balancing thorough line detection with preservation of important content.

When are IP Profiles executed?

An IP Profile is executed whenever Grooper needs to process an image using a defined sequence or hierarchy of image processing operations. Execution typically occurs in the following scenarios:

  • By the Image Processing activity: The Image Processing activity will apply the IP Profile and permanently alter the image.
  • By an OCR Profile: OCR Profiles configured with an IP Profile will run the IP Profile on an image prior to handing it to the OCR engine. The image will not be permanently altered.
  • By the Recognize activity's "Alternate IP" configuration: IP Profiles executed by this configuration will only execute feature detection commands (such as Line Detection) to collect layout data.
  • During a Review step: Users can manually execute an IP Profile from the Thumbnail Viewer (if configured to allow the user to do so).

Execution follows the order and logic defined in the IP Profile, including any conditional flow control or branching. Each step or group within the profile is applied in sequence, transforming the input image and producing results for each subsequent step.