Text Preprocessor

From Grooper Wiki

This article is about the current version of Grooper.

Note that some content may still need to be updated.

2025

Grooper's "Text Preprocessor" adjusts how raw text is formatted before extraction. It manipulates control characters (such as CR/LF pairs) to allow regular expression patterns to match (or ignore) structural elements, such as line breaks, paragraph boundaries and tab markers. The Text Preprocessor executes the following:

Where do I find the Text Preprocessor?

Text Preprocessor options found in a Pattern Match extractor's property grid.

The Text Preprocessor can be enabled by various configuration items in Grooper (generally by a "Preprocessing" or "Preprocessing Options" property). This includes:

  • The Ask AI Value Extractor and its "Preprocessing" property.
  • The Pattern Match Value Extractor and its "Preprocessing" property.
  • The List Match Value Extractor and its "Preprocessing" property.
  • The Label Match Value Extractor and its "Preprocessing" property.
  • The Word Match Value Extractor and its "Preprocessing" property.
  • The Field Match Value Extractor and its "Preprocessing" property.
  • The Pattern-Based Collation Provider and its "Preprocessing Options" property.
  • The Flow Layout Provider and its "Preprocessing Options" property.
  • The following Quoting Methods using their "Preprocessing" property:
    • Extracted
    • Labeled Region
    • Semantic

Text Preprocessor features

The Text Preprocessor has four key features. Each of these can be enabled and configured in its sub-properties.

  • Paragraph Marking
    Detects paragraph boundaries and converts line breaks within paragraphs to spaces, while preserving paragraph-ending breaks. This allows extractors to match values that span multiple lines within a paragraph, without matching across paragraph boundaries.
  • Tab Marking
    Replaces large horizontal whitespace gaps with TAB characters, making it possible to distinguish between normal spaces and significant gaps in regular expressions.
  • Vertical Tab Marking
    Converts certain line breaks to vertical tab characters based on vertical spacing, enabling recognition of vertical structure in tabular or multi-column layouts.
  • Ignore Control Characters
    Removes or replaces selected control characters (such as spaces, newlines, form feeds, and carriage returns) according to the 'Ignore Control Characters' setting. This can simplify extraction in documents with inconsistent or excessive whitespace.

Paragraph Marking

Paragraph Marking detects and marks paragraph boundaries. Instead of placing like breaks (carriage return and line feed pairs) at the end of each line, it places them at the end of each paragraph.

Please see the Paragraph Marking marking article for a complete breakdown of this feature.

Purpose

Paragraph Marking produces a normalized text flow for unstructured documents. It makes it easier to extract values that span lines.

Function

Paragraph Marking is used to identify and mark paragraph boundaries within documents.

Paragraph Marking in Action

Before Preprocessing:

Notice the \r\n pairs at the ends of the lines in this simple paragraph. What if you wanted all of this information on one line of text with no line breaks?
Lorem ipsum dolor sit amet, consectetur adipiscing elit,\r\n
sed do eiusmod tempor incididunt ut labore et dolore\r\n
magna aliqua.\r\n

After Preprocessing:

The \r\n pairs have been replaced with white-space and the paragraph is now on one line of text.
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.\r\n

Tab Marking

Tab Marking detects and inserts horizontal tab characters into the text flow, based on whitespace gaps, changes in font size or the presence of vertical lines in the Layout Data.

Please see the Tab Marking marking article for a complete breakdown of this property.

Purpose

This transformation helps regular expressions distinguish between normal spaces and wider gaps. Gaps in whitespace typically separate fields and field values on a page. Anchoring off tab characters in a regex pattern can greatly improve data extraction accuracy.

Function

Converts significant horizontal white space (spaces) into TAB characters. By enabling Tab Marking, the \t metacharacter can be used in regular expressions to distinguish between regular spaces and significant horizontal gaps.

  • When "Vertical Lines" are enabled in the "Detection Options", vertical lines in the Layout Data that interrupt a text segment will also be converted into TAB characters.

Tab Marking in Action

Take this example with a "Patient Name" field and an "Intake Date" field.

PATIENT NAME: JOHN DOE                 INTAKE DATE: 01/01/2019

The large gap in whitespace makes it obvious where one field starts and another begins.

Before Preprocessing:

All spaces regardless of size are rendered as space characters
PATIENT NAME: JOHN DOE INTAKE DATE: 01/01/2019

After Preprocessing: Tab Marking converts large horizontal gaps into TAB characters:

PATIENT NAME: JOHN DOE\tINTAKE DATE: 01/01/2019

Vertical Tab Marking

"Vertical Tab Marking" is a Text Preprocessor feature designed to handle large vertical whitespace gaps within document text. It converts line breaks (carriage return and line feed pairs \r\n) into vertical tab characters (\v) based on the size of the vertical gap between two lines of text.

  • Note: Vertical Tab Marking was used early on in Grooper to detect paragraph boundaries. However, Paragraph Marking is a much more robust means of detecting paragraphs in modern Grooper versions.

Purpose

Large vertical gaps between lines can indicate new sections, a table row or other logical breaks in a document's structure.

By enabling Vertical Tab Marking, the \v metacharacter can be used in regular expressions to distinguish between regular new lines and significant vertical gaps that span multiple lines.

Function

Converts newline characters to vertical tab characters if the gap between the text lines meets a specified threshold.

Vertical Gap Threshold

This is a sub-property of "Vertical Tab Marking". It is revealed when Vertical Tab Marking is enabled. It defines the minimum size of a vertical gap that will be converted to a vertical tab. The default value is 0.25 inches.

Vertical Tab Marking in Action

Consider a text document where information is spaced out over multiple lines with significant gaps.

Invoice Number:    12345



Date:              01/01/2023 

Before Preprocessing:

By default, newline characters simply represent a line break. Standard text extraction might not accurately capture the separation between "Invoice Number" and "Date" due to intervening newlines.
Invoice Number:    12345\r\n
Date:              01/01/2023

After Preprocessing:

Here, the vertical space converts to a "\v" character, which represents a vertical tab. This allows extractors to more precisely differentiate between data fields.
Invoice Number:    12345\v
Date:              01/01/2023

Ignore Control Characters

The "Ignore Control Characters" property specifies which control characters from the source document should be removed. This can facilitate cleaner and more efficient data extraction in certain scenarios.

Purpose

Enhance situational data extraction. By removing control characters, the text data becomes cleaner, simplifying the extraction process and enhancing accuracy.

Function

Control which specific control characters are disabled. Users can select specific control characters that they want to ignore during data extraction.

The following can be removed:

  • Space characters
  • New Line characters (\r\n pairs between lines)
  • Form Feed characters (\f characters between pages)
  • Carriage return caracters (\r characters between lines)

Ignore Control Characters in Action

Before Preprocessing: Consider a document with various control characters interspersed within the text:

Invoice Number:\r\n
12345\f
PO Number: \r\n
67890

After Preprocessing:

While this is an extreme example, the following would be the result when all control characters are removed.
InvoiceNumber:12345PONumber:67890