2023:Line Removal (IP Command)

From Grooper Wiki

This article is about an older version of Grooper.

Information may be out of date and UI elements may have changed.

202520232.80

STUB

This article is a stub. It contains minimal information on the topic and should be expanded.

The Line Removal property panel

Line Removal is an IP Command that removes horizontal and vertical lines from documents. The line locations are then stored as part of the object's layout data.

This serves two main functions:

  1. To improve OCR results by removing non-text portions of the page.
  2. To save where those lines exist on the document for later use.


About

Lines on documents serve an important function. They visually break up a document, giving the viewer cues where to look.

Calling attention to something...
This is in a box. Made you look!
...dividing sections...

Topic Heading


You intuitively know what's underneath the line has to do with the topic, just because there's a line there.

...or organizing information into tables.
Name Birthday Favorite Animal
Peter Parker 08/27/1993 Spider
Otto Octavius 07/02/1963 Octopus
Adrian Toomes 05/02/1963 Vulture

While visually useful, lines can get in the way of good OCR results. OCR translates pixels in an image into machine readable text. It looks at clusters of black pixels on the page, compares those collections of pixels to examples of letters and other characters, and makes a decision about what letter it should be. If nothing else, lines are just extra pixels OCR has to analyze. The OCR engine has to make some kind of determination as to what those pixels are, even if it ultimately ignores it. However, especially when lines are close to other characters on a page, the OCR engine may think that line is part of the character and misread the character. Long story short, line removal can help improve OCR's accuracy during the Recognize activity.

However, as we discussed before, lines are great visual cues. For example, table lines are often the best identifier of differentiating one piece of data from another. The Line Removal IP command also saves the position of where those lines are so that Grooper can use that information later.

Use Cases

Line Removal is useful for improving OCR results on any document containing horizontal or vertical lines. Primarily, Line Removal is used as part of a temporary IP Profile set on an OCR Profile. When an IP Profile is set on an OCR Profile, document images are temporarily altered before the OCR engine runs during the Recognize activity. Once OCR is finished, that temporary image is discarded and the document's image reverts back to its original state.

Line Removal is perfectly suited for documents containing tables. Not only will the command remove lines to improve OCR, it will save the table's visual layout by saving the positional data of those lines. The line positions are saved to a file named "LayoutData.json" on each page object. Whenever Grooper needs to know line locations (or any other layout information) it will look to that document to use that information.

For this table... ...Line Removal removes the lines... ...and saves their positions to the page object.


The Infer Grid table extraction method is a good example of how Grooper can use line location data. Once Infer Grid is set up to detect the layout's structure based on column (and/or row) headers, if it knows where the lines in the table are, it can simply extract text from cells from the boundaries created by the table lines.

Version Differences

There are no major differences to point out at this time.