Line Detection and Line Removal: Difference between revisions
Dgreenwood (talk | contribs) |
Dgreenwood (talk | contribs) |
||
| Line 133: | Line 133: | ||
#* Fine-tune the Line Removal settings as needed. | #* Fine-tune the Line Removal settings as needed. | ||
# The Line Removal step will execute as part of the IP Profile's execution flow. | # The Line Removal step will execute as part of the IP Profile's execution flow. | ||
== Similarities and differences between Line Detection and Line Removal == | == Similarities and differences between Line Detection and Line Removal == | ||
Revision as of 15:15, 7 August 2025
Line Detection and Line Removal are two closely related IP Commands in Grooper. Both are designed to identify horizontal and vertical lines in document images, such as those found in forms, tables, or pre-printed backgrounds. Both commands generate layout data describing the position and orientation of detected lines, which can be used for downstream processing. However, while Line Detection only detects and analyzes lines, Line Removal will also remove lines from the image, cleaning up visual noise that may interfere with OCR or data extraction.
What is Line Detection?
Line Detection is an IP Command that locates horizontal and vertical lines on documents. The detected line locations are stored as part of page's layout data.
Line Detection is an IP Command in Grooper. It can be added to a step in an IP Profile or IP Group. The Line Detection command is designed to identify horizontal and vertical lines from document pages. This is essential for various extractors and features which use line locations as part of their function. For example, the Tabular Layout table extract method will use line locations to improve how it detects row, column, and cell boundaries.
When the Line Detection command is executed, it:
- Optionally preprocesses the image (including binarization and font dropout) to enhance line visibility.
- Scans for horizontal and vertical runs of black pixels, applying configurable thresholds for length, thickness, aspect ratio, and fill.
- Detects lines using a combination of morphological and geometric analysis, with support for advanced features such as comb detection, speck removal, and dash sequence handling.
- Saves layout data describing the position and orientation of detected lines, which can be used for table extraction, form field alignment, or visual overlays.
Line Detection does not alter the original image. Its primary output is the set of detected line objects, which can be used by downstream layout analysis or extraction logic.
Use cases for Line Detection
Grooper can use Line Detection to:
- Identify table boundaries and grid structures for data extraction.
- Locate underlines, boxes, or separators for form field alignment.
- Provide visual overlays for reviewing detected lines in diagnostics or the UI.
- Generate layout data for downstream processing, such as table extraction or snapping results to nearby lines with an extractor's "Line Snap Options".
General configuration steps
To use Line Detection:
- Right-click an IP Profile or IP Group to add a Line Detection IP Step.
- Select "Add Command" then "Feature Detection" then "Line Detection".
- Adjust detection settings as needed.
- The detection settings are:
- Minimum Line Length
- Maximum Line Thickness
- Maximum Line Gap
- Minimum Aspect Ratio
- Maximum Edge Noise
- Trim Distance
- Minimum Run Length
- Advanced Line Detection
- Dash Detection
- Maximum Speck Size
- The default settings work well for most scenarios. Check out the About Line Detection settings section for more information on each of these properties.
- The detection settings are:
- Adjust "Comb Removal" settings as needed.
- Combs are short line segments connected to perpendicular lines. Think about forms that have grid lines or boxes for you to print each letter of your name. These are called "comb boxes". The vertical lines connected to horizontal lines are called "combs".
- The Comb Removal settings are:
- Comb Removal
- Minimum Length
- Minimum Fill
- Quiet Zone Size
- Minimum Weight
- The default settings work well for most scenarios. Check out the About Line Detection settings section for more information on each of these properties.
- Adjust "Image Preprocessing" settings as needed.
- These settings optimize line visibility, improving Line Detection's ability to detect lines. Preprocessing adjustments only occur prior to detecting lines and do not alter the final image.
- "Binarization Settings" control how color or grayscale images are turned black and white. Line Detection must occur on a black and white image.
- "Dropout Font Size" defines the largest font size to be dropped out during preprocessing.
Advanced settings for comb detection, speck removal, and dash sequence handling as needed.
- Test the IP Step/IP Group/IP Profile using the "Tester" tab. Use the Diagnostics panel's images and files to visualize results.
- Review the "Binarized" and "Preprocessed" diagnostic images to ensure lines are clearly separated from the background.
- Review the "Dropout Mask" and "Trim Mask" diagnostics to see which regions would be affected by line removal.
- Fine-tune the Line Detection settings as needed.
- The Line Detection step will execute as part of the IP Profile's execution flow.
The detected line information is stored as layout data, making it available for downstream activities such as data extraction, classification, or routing.
What is Line Removal?
Line Removal is an IP Command that locates and removes horizontal and vertical lines from documents. The detected line locations are stored as part of page's layout data.
Line Removal is an IP Command in Grooper. It can be added to a step in an IP Profile or IP Group. Line Removal has two purposes:
- Like Line Detection, it identifies horizontal and vertical lines in document images, generating layout data for downstream use.
- Line Removal goes one step further by digitally erasing detected lines from the image. This can eliminate unwanted line content before OCR occurs, improving OCR accuracy and subsequent processing downstream.
When the Line Removal command is executed, it:
- Optionally preprocesses the image (including binarization and font dropout) to enhance line visibility.
- Detects lines using the same configurable detection logic as Line Detection.
- Generates a dropout mask to cover each detected line, optionally including a trim distance around lines to ensure clean removal.
- Removes the masked regions from the image according to the "Dropout Method" setting.
- Outputs the cleaned image with lines removed, as well as layout data describing the detected lines.
Line Removal is most useful for cleaning up forms, tables, or pre-printed backgrounds prior to OCR or data extraction, ensuring that only relevant content remains on the image.
Use cases for Line Removal
Line Removal's use cases include all those of Line Detection, plus:
- Cleaning up document images before OCR to prevent lines from interfering with text recognition.
- Removing table borders, underlines, or grid lines that are not needed for downstream processing.
- Ensuring that only relevant content remains on the image for data extraction or export.
General configuration steps
To use Line Removal:
- Right-click an IP Profile or IP Group to add a Line Removal IP Step.
- Select "Add Command" then "Feature Removal" then "Line Removal".
- Adjust detection settings as needed.
- The detection settings are:
- Minimum Line Length
- Maximum Line Thickness
- Maximum Line Gap
- Minimum Aspect Ratio
- Maximum Edge Noise
- Trim Distance
- Minimum Run Length
- Advanced Line Detection
- Dash Detection
- Maximum Speck Size
- The default settings work well for most scenarios. Check out the About Line Detection settings section for more information on each of these properties.
- The detection settings are:
- Adjust "Comb Removal" settings as needed.
- Combs are short line segments connected to perpendicular lines. Think about forms that have grid lines or boxes for you to print each letter of your name. These are called "comb boxes". The vertical lines connected to horizontal lines are called "combs".
- The Comb Removal settings are:
- Comb Removal
- Minimum Length
- Minimum Fill
- Quiet Zone Size
- Minimum Weight
- The default settings work well for most scenarios. Check out the About Line Detection settings section for more information on each of these properties.
- Adjust "Image Preprocessing" settings as needed.
- These settings optimize line visibility, improving Line Detection's ability to detect lines. Preprocessing adjustments only occur prior to detecting lines and do not alter the final image.
- "Binarization Settings" control how color or grayscale images are turned black and white. Line Detection must occur on a black and white image.
- "Dropout Font Size" defines the largest font size to be dropped out during preprocessing.
- Configure "Dropout Method" settings as needed.
- There are two Dropout Methods to choose from:
- Fill (Default) - Replaces lines with a solid color. This method is generally sufficient for temporarily removing lines when applied by an OCR Profile.
- By default, the Fill Color is set to "none". When set to "none" Grooper will detect a color from the image's background and use that.
- Inpaint - Digitally restores masked regions by estimating and filling in missing or damaged areas using advanced algorithms. When removing lines that overlap text, this method can help preserve overlapping text.
- Fill (Default) - Replaces lines with a solid color. This method is generally sufficient for temporarily removing lines when applied by an OCR Profile.
- (Optional) The "Mask Dilation Factor" will dilate or erode the dropout mask applied to the image. Expanding this value will help ensure no stray artifacts attached to lines remain, but setting it too high may remove nearby content.
- Setting the "Trim Distance" property will also control how much area around lines is removed.
- There are two Dropout Methods to choose from:
- Test the IP Step/IP Group/IP Profile using the "Tester" tab. Use the Diagnostics panel's images and files to visualize results.
- Review the "Binarized", "Preprocessed", "Dropout Mask", and "Output Image" diagnostics to ensure lines are being removed as intended.
- Fine-tune the Line Removal settings as needed.
- The Line Removal step will execute as part of the IP Profile's execution flow.
Similarities and differences between Line Detection and Line Removal
Similarities
The big takeaway: Both Line Detection and Line Removal detect lines and generate layout data.
- Both commands detect horizontal and vertical lines in document images.
- Both generate layout data describing the position and orientation of detected lines, which can be used for downstream processing.
- Both use the same detection technology and support the same range of pixel formats.
- Both can be configured with the same detection properties, including thresholds for line length, thickness, aspect ratio, fill, and advanced features like comb detection and speck removal.
- Both provide diagnostic output to assist with configuration and troubleshooting.
- Both are included as steps in an IP Profile, allowing them to be combined with other image processing operations.
Differences
The big takeaway: Line Detection does not alter the image. Line Removal does alter the image.
- Both Line Detection and Line Removal detect lines and store their information as layout data.
- Only Line Removal removes detected lines from images, masking them out before further processing.
- This is most useful for image cleanup prior to OCR or data extraction.
- Lines will be removed from an image permanently when Line Removal is applied by the Image Processing activity.
- Lines will be removed in-memory prior to OCR when Line Removal is applied by the Recognize activity.
- Only Line Removal has Feature Removal properties such as "Dropout Method" and "Trim Distance".
When to use each command
- Use Line Detection when you only need to extract line information for table extraction, field alignment, or layout analysis without altering the image.
- Use Line Removal when you want to eliminate lines or to remove line-like noise prior to OCR or data extraction. Line Removal will extract line information as well.
When are IP Profiles executed?
An IP Profile is executed whenever Grooper needs to process an image using a defined sequence or hierarchy of image processing operations. Execution typically occurs in the following scenarios:
- By the Image Processing activity: The Image Processing activity will apply the IP Profile and permanently alter the image.
- By an OCR Profile: OCR Profiles configured with an IP Profile will run the IP Profile on an image prior to handing it to the OCR engine. The image will not be permanently altered.
- By the Recognize activity's "Alternate IP" configuration: IP Profiles executed by this configuration will only execute feature detection commands (such as Line Detection) to collect layout data.
- During a Review step: Users can manually execute an IP Profile from the Thumbnail Viewer (if configured to allow the user to do so).
Execution follows the order and logic defined in the IP Profile, including any conditional flow control or branching. Each step or group within the profile is applied in sequence, transforming the input image and producing results for each subsequent step.