Preprocessing (Property)

From Grooper Wiki

This article is about the current version of Grooper.

Note that some content may still need to be updated.

2025

The Preprocessing grouping of properties consists of settings that adjust how text is formatted and interpreted before any Data Extraction process begins. These properties are crucial for ensuring that the text data is in the most optimal format for subsequent extraction tasks, which could involve complex regular expressions or precise data parsing.

About

Preprocessing can be found on any text parsing Extractor Type, such as (but not limited to) List Match or Pattern Match.

Properties in the Preprocessing Group

Understanding of Preprocessing comes from developing an understanding of the sub-properties that are grouped within.

Paragraph Marking

Please see the Paragraph Marking marking article for a complete breakdown of this property.

Purpose

Essential for documents where data spans multiple lines or paragraphs, as it transforms embedded carriage return and line feed pairs within paragraphs into spaces while preserving those pairs at paragraph ends.

Function

This property is used to identify and mark paragraph boundaries within documents.

Paragraph Marking in Action

Before Preprocessing:

  • Notice the \r\n pairs at the ends of the lines in this simple paragraph. What if you wanted all of this information on one line of text with no line breaks?
Lorem ipsum dolor sit amet, consectetur adipiscing elit,\r\n
sed do eiusmod tempor incididunt ut labore et dolore\r\n
magna aliqua.

After Preprocessing:

  • The \r\n pairs have been replaced with white-space and the paragraph is now on one line of text.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.

Tab Marking

Please see the Tab Marking marking article for a complete breakdown of this property.

Purpose

This transformation helps regular expressions distinguish between normal spaces and wider gaps that typically separate data fields, improving data extraction accuracy.

Function

Converts significant horizontal white space (spaces) into TAB characters.

Tab Marking in Action

Before Preprocessing:

  • Notice the large whites-space gap between "Value 1" and "Value 2".
Value 1          Value 2

After Preprocessing:

  • Tab Marking would convert large horizontal spaces into TABs:
Value 1\tValue 2

Vertical Tab Marking

The Vertical Tab Marking property is a text Preprocessing option designed to handle large vertical whitespace gaps within document text. It converts carriage return and line feed pairs (CR/LF pairs, which are newline characters \r\n) into vertical tab characters (\v), based on the size of the vertical gap between two lines of text.

Note: Vertical Tab Marking is a property that was added near the very beginning of Grooper that was meant as a means to detect paragraphs. However, Paragraph Marking is a much more robust means of detecting paragraphs in modern Grooper. As a result, this property should be considered legacy and not really used unless in very niche situations.

Purpose

To distinguish between regular new lines and significant vertical gaps that span multiple lines, aiding in accurate text extraction and data parsing.

Function

Converts newline characters to vertical tab characters if the gap between the text lines meets a specified threshold.

Vertical Gap Threshold

This is a sub-property of the Vertical Tab Marking property that is revealed when 'Vertical Tab Marking is enabled. It defines the minimum size of a vertical gap that will be converted to a vertical tab. The default value is 0.25 inches.

Vertical Tab Marking in Action

Consider a text document where information is spaced out over multiple lines with significant gaps.

Before Preprocessing:

  • By default, newline characters simply represent a line break. Standard text extraction might not accurately capture the separation between "Invoice Number" and "Date" due to intervening newlines.
Invoice Number:    12345



Date:              01/01/2023 

After Preprocessing:

  • Here, the vertical space converts to a "\v" character, which represents a vertical tab. This allows extractors to more precisely differentiate between data fields.
Invoice Number:    12345\vDate:              01/01/2023

Ignore Control Characters

The Ignore Control Characters property is a Preprocessing option that enables the removal of specific control characters from the source document to facilitate cleaner and more efficient data extraction. Such control characters can often interfere with text processing and data extraction tasks.

Purpose

Enhance situational data extraction. By removing control characters, the text data becomes cleaner, simplifying the extraction process and enhancing accuracy.

Function

Control which specific control characters are disabled. Users can select specific control characters that they want to ignore during data extraction, including spaces, newlines (CR/LF pairs), form feed characters, and carriage returns.

Ignore Control Characters in Action

Consider a document with various control characters interspersed within the text:

Before Preprocessing:

  • Consider the possibility of not wanting the displayed control characters to be present.
Invoice Number:\r\n
12345\f
PO Number: \r
67890

After Preprocessing:

  • Using the Ignore Control Characters property, Grooper can transform this text by removing the specified unwanted characters, resulting in cleaner output.
Invoice Number:
12345
PO Number:
67890