Preprocessing (Property)

From Grooper Wiki

This article is about the current version of Grooper.

Note that some content may still need to be updated.

2025

The Preprocessing grouping of properties consists of settings that adjust how text is formatted and interpreted before any Data Extraction process begins. These properties are crucial for ensuring that the text data is in the most optimal format for subsequent extraction tasks, which could involve complex regular expressions or precise data parsing.

Properties in the Preprocessing Group

Understanding of Preprocessing comes from developing an understanding of the sub-properties that are grouped within.

Paragraph Marking

Please see the Paragraph Marking marking article for a complete breakdown of this property.

Purpose

Essential for documents where data spans multiple lines or paragraphs, as it transforms embedded carriage return and line feed pairs within paragraphs into spaces while preserving those pairs at paragraph ends.

Function

This property is used to identify and mark paragraph boundaries within documents.

Tab Marking

Please see the Tab Marking marking article for a complete breakdown of this property.

Purpose

This transformation helps regular expressions distinguish between normal spaces and wider gaps that typically separate data fields, improving data extraction accuracy.

Function

Converts significant horizontal white space (spaces) into TAB characters.

Paragraph and Tab Marking in Action

Before Preprocessing:

Value 1          Value 2

After Preprocessing:

  • Tab Marking would convert large horizontal spaces into TABs:
Value 1\tValue 2

Vertical Tab Marking

The Vertical Tab Marking property is a text Preprocessing option designed to handle large vertical whitespace gaps within document text. It converts carriage return and line feed pairs (CR/LF pairs, which are newline characters \r\n) into vertical tab characters (\v), based on the size of the vertical gap between two lines of text.

Note: Vertical Tab Marking is a property that was added near the very beginning of Grooper that was meant as a means to detect paragraphs. However, Paragraph Marking is a much more robust means of detecting paragraphs in modern Grooper. As a result, this property should be considered legacy and not really used unless in very niche situations.

Purpose

To distinguish between regular new lines and significant vertical gaps that span multiple lines, aiding in accurate text extraction and data parsing.

Function

Converts newline characters to vertical tab characters if the gap between the text lines meets a specified threshold.

Vertical Gap Threshold

This is a sub-property of the Vertical Tab Marking property that is revealed when 'Vertical Tab Marking is enabled. It defines the minimum size of a vertical gap that will be converted to a vertical tab. The default value is 0.25 inches.

Vertical Tab Marking in Action

Consider a text document where information is spaced out over multiple lines with significant gaps.

Before Preprocessing:

Invoice Number:    12345

Date:              01/01/2023 

By default, newline characters simply represent a line break. Standard text extraction might not accurately capture the separation between "Invoice Number" and "Date" due to intervening newlines.

After Preprocessing:

Invoice Number:    12345\vDate:              01/01/2023

Here, the vertical space converts to a "\v" character, which represents a vertical tab. This allows extractors to more precisely differentiate between data fields.

Advantages

  • Enhanced Accuracy: Improves the precision of data extraction by distinguishing between minor and major gaps within text blocks.
  • Better Structuring: Facilitates better structural representation of text, helping in scenarios where text is not uniformly spaced.

Ignore Control Characters

The Ignore Control Characters property is a Preprocessing option that enables the removal of specific control characters from the source document to facilitate cleaner and more efficient data extraction. Such control characters can often interfere with text processing and data extraction tasks.

Purpose

Enhance situational data extraction. By removing control characters, the text data becomes cleaner, simplifying the extraction process and enhancing accuracy.

Function

Control which specific control characters are disabled. Users can select specific control characters that they want to ignore during data extraction, including spaces, newlines (CR/LF pairs), form feed characters, and carriage returns.

Ignore Control Characters in Action

Consider a document with various control characters interspersed within the text:

Before Preprocessing:

Invoice Number:\r\n
12345\f
PO Number: \r
67890

Consider the possibility of not wanting the displayed control characters to be present.

After Preprocessing:

Invoice Number:
12345
PO Number:
67890

Using the Ignore Control Characters property, Grooper can transform this text by removing the specified unwanted characters, resulting in cleaner output.

Advantages

  • Reduction of Noise: In specific situations, helps in eliminating non-essential characters that disrupt data integrity.
  • Streamlined Data Parsing: Makes parsing and interpreting text data more straightforward by removing elements that do not contribute to the actual information.
  • Enhanced Accuracy: When needed, can improve the accuracy of text extraction by focusing only on relevant characters.