2023:Line Removal (IP Command)

From Grooper Wiki
Revision as of 09:54, 27 August 2024 by Randallkinard (talk | contribs)

This article is about an older version of Grooper.

Information may be out of date and UI elements may have changed.

20232.80

STUB

This article is a stub. It contains minimal information on the topic and should be expanded.

The Line Removal property panel

Line Removal is an IP Command that locates and removes horizontal and vertical lines from documents. The detected line locations are stored as part of page's layout data.

This serves two main functions:

  1. To improve OCR results by removing non-text portions of the page.
  2. To save where those lines exist on the document for later use.


About

Lines on documents serve an important function. They visually break up a document, giving the viewer cues where to look.

Calling attention to something...
This is in a box. Made you look!
...dividing sections...

Topic Heading


You intuitively know what's underneath the line has to do with the topic, just because there's a line there.

...or organizing information into tables.
Name Birthday Favorite Animal
Peter Parker 08/27/1993 Spider
Otto Octavius 07/02/1963 Octopus
Adrian Toomes 05/02/1963 Vulture

While visually useful, lines can get in the way of good OCR results. OCR translates pixels in an image into machine readable text. It looks at clusters of black pixels on the page, compares those collections of pixels to examples of letters and other characters, and makes a decision about what letter it should be. If nothing else, lines are just extra pixels OCR has to analyze. The OCR engine has to make some kind of determination as to what those pixels are, even if it ultimately ignores it. However, especially when lines are close to other characters on a page, the OCR engine may think that line is part of the character and misread the character. Long story short, line removal can help improve OCR's accuracy during the Recognize activity.

However, as we discussed before, lines are great visual cues. For example, table lines are often the best identifier of differentiating one piece of data from another. The Line Removal IP command also saves the position of where those lines are so that Grooper can use that information later.

Use Cases

Line Removal is useful for improving OCR results on any document containing horizontal or vertical lines. Primarily, Line Removal is used as part of a temporary IP Profile set on an OCR Profile. When an IP Profile is set on an OCR Profile, document images are temporarily altered before the OCR engine runs during the Recognize activity. Once OCR is finished, that temporary image is discarded and the document's image reverts back to its original state.

Line Removal is perfectly suited for documents containing tables. Not only will the command remove lines to improve OCR, it will save the table's visual layout by saving the positional data of those lines. The line positions are saved to a file named "LayoutData.json" on each page object. Whenever Grooper needs to know line locations (or any other layout information) it will look to that document to use that information.

For this table... ...Line Removal removes the lines... ...and saves their positions to the page object.


The Infer Grid table extraction method is a good example of how Grooper can use line location data. Once Infer Grid is set up to detect the layout's structure based on column (and/or row) headers, if it knows where the lines in the table are, it can simply extract text from cells from the boundaries created by the table lines.

Version Differences

There are no major differences to point out at this time.

Glossary

IP Command: IP Commands specify an image processing (IP) operation (such as image cleanup, format conversion or feature detection) and are used to construct image IP Steps in an IP Profile. IP Commands are configured using an IP Step's Command property.

IP Profile: perm_media IP Profiles are a step-by-step list of image processing operations (IP Commands). They are used for several image processing related operations, but primarily for:

  1. Permanently enhancing an image during the Image Processing activity (usually to get rid of defects in a scanned image, such as skewing or borders).
  2. Cleaning up an image in-memory during the Recognize activity without altering the image to improve OCR accuracy.
  3. Computer vision operations that collect layout data (table line locations, OMR checkboxes, barcode value and more) utilized in data extraction.

Line Removal: Line Removal is an IP Command that locates and removes horizontal and vertical lines from documents. The detected line locations are stored as part of page's layout data.

OCR Profile: library_books OCR Profiles store configuration settings for optical character recognition (OCR). They are used by the Recognize activity to convert images of text on contract Batch Pages into machine-encoded text. OCR Profiles are highly configurable, allowing fine-grained control over how OCR occurs, how pre-OCR image cleanup occurs, and how Grooper's OCR Synthesis occurs. All this works to the end goal of highly accurate OCR text data, which is used to classify documents, extract data and more.

OCR: OCR is stands for Optical Character Recognition. It allows text on paper documents to be digitized, in order to be searched or edited by other software applications. OCR converts typed or printed text from digital images of physical documents into machine readable, encoded text.

Recognize: format_letter_spacing_wide Recognize is an Activity that obtains machine-readable text from contract Batch Pages and folder Batch Folders. When properly configured with an library_booksOCR Profile, Recognize will selectively perform OCR for images and native-text extraction for digital text in PDFs. Recognize can also reference an perm_mediaIP Profile to collect "layout data" like lines, checkboxes, and barcodes. Other Activities then use this machine-readable text and layout data for document analysis and data extraction.