Preprocessing (Property): Difference between revisions

Latest revision as of 12:33, 3 March 2025

This article is about the current version of Grooper.

Note that some content may still need to be updated.

2025

The Preprocessing grouping of properties consists of settings that adjust how text is formatted and interpreted before any Data Extraction process begins. These properties are crucial for ensuring that the text data is in the most optimal format for subsequent extraction tasks, which could involve complex regular expressions or precise data parsing.

About

Preprocessing can be found on any text parsing Extractor Type, such as (but not limited to) List Match or Pattern Match.

Properties in the Preprocessing Group

Understanding of Preprocessing comes from developing an understanding of the sub-properties that are grouped within.

Paragraph Marking

Please see the Paragraph Marking marking article for a complete breakdown of this property.

Purpose

Essential for documents where data spans multiple lines or paragraphs, as it transforms embedded carriage return and line feed pairs within paragraphs into spaces while preserving those pairs at paragraph ends.

Function

This property is used to identify and mark paragraph boundaries within documents.

Paragraph Marking in Action

Before Preprocessing:

Notice the \r\n pairs at the ends of the lines in this simple paragraph. What if you wanted all of this information on one line of text with no line breaks?

Lorem ipsum dolor sit amet, consectetur adipiscing elit,\r\n
sed do eiusmod tempor incididunt ut labore et dolore\r\n
magna aliqua.

After Preprocessing:

The \r\n pairs have been replaced with white-space and the paragraph is now on one line of text.

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.

Tab Marking

Please see the Tab Marking marking article for a complete breakdown of this property.

Purpose

This transformation helps regular expressions distinguish between normal spaces and wider gaps that typically separate data fields, improving data extraction accuracy.

Function

Converts significant horizontal white space (spaces) into TAB characters.

Tab Marking in Action

Before Preprocessing:

Notice the large whites-space gap between "Value 1" and "Value 2".

Value 1          Value 2

After Preprocessing:

Tab Marking would convert large horizontal spaces into TABs:

Value 1\tValue 2

Vertical Tab Marking

The Vertical Tab Marking property is a text Preprocessing option designed to handle large vertical whitespace gaps within document text. It converts carriage return and line feed pairs (CR/LF pairs, which are newline characters \r\n) into vertical tab characters (\v), based on the size of the vertical gap between two lines of text.

Note: Vertical Tab Marking is a property that was added near the very beginning of Grooper that was meant as a means to detect paragraphs. However, Paragraph Marking is a much more robust means of detecting paragraphs in modern Grooper. As a result, this property should be considered legacy and not really used unless in very niche situations.

Purpose

To distinguish between regular new lines and significant vertical gaps that span multiple lines, aiding in accurate text extraction and data parsing.

Function

Converts newline characters to vertical tab characters if the gap between the text lines meets a specified threshold.

Vertical Gap Threshold

This is a sub-property of the Vertical Tab Marking property that is revealed when 'Vertical Tab Marking is enabled. It defines the minimum size of a vertical gap that will be converted to a vertical tab. The default value is 0.25 inches.

Vertical Tab Marking in Action

Consider a text document where information is spaced out over multiple lines with significant gaps.

Before Preprocessing:

By default, newline characters simply represent a line break. Standard text extraction might not accurately capture the separation between "Invoice Number" and "Date" due to intervening newlines.

Invoice Number:    12345



Date:              01/01/2023

After Preprocessing:

Here, the vertical space converts to a "\v" character, which represents a vertical tab. This allows extractors to more precisely differentiate between data fields.

Invoice Number:    12345\vDate:              01/01/2023

Ignore Control Characters

The Ignore Control Characters property is a Preprocessing option that enables the removal of specific control characters from the source document to facilitate cleaner and more efficient data extraction. Such control characters can often interfere with text processing and data extraction tasks.

Purpose

Enhance situational data extraction. By removing control characters, the text data becomes cleaner, simplifying the extraction process and enhancing accuracy.

Function

Control which specific control characters are disabled. Users can select specific control characters that they want to ignore during data extraction, including spaces, newlines (CR/LF pairs), form feed characters, and carriage returns.

Ignore Control Characters in Action

Consider a document with various control characters interspersed within the text:

Before Preprocessing:

Consider the possibility of not wanting the displayed control characters to be present.

Invoice Number:\r\n
12345\f
PO Number: \r
67890

After Preprocessing:

Using the Ignore Control Characters property, Grooper can transform this text by removing the specified unwanted characters, resulting in cleaner output.

Invoice Number:
12345
PO Number:
67890

@@ Line 2: / Line 2: @@
 <blockquote>{{#lst:Glossary|Preprocessing}}</blockquote>
+== About ==
+'''''Preprocessing''''' can be found on any text parsing '''''Extractor Type''''', such as (but not limited to) '''''List Match''''' or '''''Pattern Match'''''.
+[[image: 2024_Preprocessing_01.png]]
 == Properties in the Preprocessing Group ==
@@ Line 7: / Line 12: @@
 <div style="padding-left: 1.5em";>
 === Paragraph Marking ===
-Please see the '''''[[Paragraph Marking (Property)|Paragraph Marking]]''''' marking article for a complete breakdown of this property.
+Please see the '''''[[Paragraph Marking]]''''' marking article for a complete breakdown of this property.
 <div style="padding-left: 1.5em";>
+==== Purpose ====
+Essential for documents where data spans multiple lines or paragraphs, as it transforms embedded [https://en.wikipedia.org/wiki/Carriage_return carriage return] and [https://en.wikipedia.org/wiki/Newline line feed] pairs within paragraphs into spaces while preserving those pairs at paragraph ends.
 ==== Function ====
 This property is used to identify and mark paragraph boundaries within documents.
-==== Purpose ====
+==== Paragraph Marking in Action ====
-Essential for documents where data spans multiple lines or paragraphs, as it transforms embedded [https://en.wikipedia.org/wiki/Carriage_return carriage return] and [https://en.wikipedia.org/wiki/Newline line feed] pairs within paragraphs into spaces while preserving those pairs at paragraph ends.
+<div style="padding-left: 1.5em";>
+<big>'''Before Preprocessing:'''</big>
+* Notice the \r\n pairs at the ends of the lines in this simple paragraph. What if you wanted all of this information on one line of text with no line breaks?
+<pre>
+Lorem ipsum dolor sit amet, consectetur adipiscing elit,\r\n
+sed do eiusmod tempor incididunt ut labore et dolore\r\n
+magna aliqua.
+</pre>
+<big>'''After Preprocessing:'''</big>
+* The \r\n pairs have been replaced with white-space and the paragraph is now on one line of text.
+<pre>
+Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
+</pre>
+</div>
 </div>
 === Tab Marking ===
-Please see the '''''[[Tab Marking (Property)|Tab Marking]]''''' marking article for a complete breakdown of this property.
+Please see the '''''[[Tab Marking]]''''' marking article for a complete breakdown of this property.
 <div style="padding-left: 1.5em";>
 ==== Purpose ====
@@ Line 24: / Line 46: @@
 Converts significant horizontal white space (spaces) into TAB characters.
-==== Paragraph and Tab Marking in Action ====
+==== Tab Marking in Action ====
 <div style="padding-left: 1.5em";>
 <big>'''Before Preprocessing''':</big>
+* Notice the large whites-space gap between "Value 1" and "Value 2".
 <pre>
-Name: John Doe
+Value 1          Value 2
-                    Address: 1234 Main St.
-City: Springfield
-Phone: (123) 456-7890
 </pre>
 <big>'''After Preprocessing''':</big>
-* '''''Paragraph Marking''''' would ensure multi-line values are accurately represented.
 * '''''Tab Marking''''' would convert large horizontal spaces into TABs:
-<pre>Name: John Doe\tAddress: 1234 Main St.\nCity: Springfield\nPhone: (123) 456-7890</pre>
+<pre>
+Value 1\tValue 2
+</pre>
 </div>
 </div>
@@ Line 45: / Line 64: @@
 === Vertical Tab Marking ===
 The '''''Vertical Tab Marking''''' property  is a text '''''Preprocessing''''' option designed to handle large vertical whitespace gaps within document text. It converts carriage return and line feed pairs (CR/LF pairs, which are newline characters <code>\r\n</code>) into vertical tab characters (<code>\v</code>), based on the size of the vertical gap between two lines of text.
+<span style="color: red";>'''Note:'''</span> '''''Vertical Tab Marking''''' is a property that was added near the very beginning of '''Grooper''' that was meant as a means to detect paragraphs. However, '''''Paragraph Marking''''' is a much more robust means of detecting paragraphs in modern '''Grooper'''. As a result, this property should be considered legacy and not really used unless in very niche situations.
 <div style="padding-left: 1.5em";>
 ==== Purpose ====
@@ Line 59: / Line 80: @@
 <div style="padding-left: 1.5em";>
 <big>'''Before Preprocessing:'''</big>
+* By default, newline characters simply represent a line break. Standard text extraction might not accurately capture the separation between "Invoice Number" and "Date" due to intervening newlines.
 <pre>
 Invoice Number:    12345
 Date:              01/01/2023
 </pre>
-By default, newline characters simply represent a line break. Standard text extraction might not accurately capture the separation between "Invoice Number" and "Date" due to intervening newlines.
 <big>'''After Preprocessing:'''</big>
+* Here, the vertical space converts to a "\v" character, which represents a vertical tab. This allows extractors to more precisely differentiate between data fields.
 <pre>
 Invoice Number:    12345\vDate:              01/01/2023
 </pre>
-Here, the vertical space converts to a "\v" character, which represents a vertical tab. This allows extractors to more precisely differentiate between data fields.
 </div>
-==== Advantages ====
-* '''Enhanced Accuracy:''' Improves the precision of data extraction by distinguishing between minor and major gaps within text blocks.
-* '''Better Structuring:''' Facilitates better structural representation of text, helping in scenarios where text is not uniformly spaced.
 </div>
 === Ignore Control Characters ===
 The '''''Ignore Control Characters''''' property is a '''''Preprocessing''''' option that enables the removal of specific control characters from the source document to facilitate cleaner and more efficient data extraction. Such control characters can often interfere with text processing and data extraction tasks.
@@ Line 89: / Line 110: @@
 <div style="padding-left: 1.5em";>
 <big>'''Before Preprocessing:'''</big>
+* Consider the possibility of not wanting the displayed control characters to be present.
 <pre>
 Invoice Number:\r\n
@@ Line 95: / Line 117: @@
 </pre>
-Consider the possibility of not wanting the displayed control characters to be present.
 <big>'''After Preprocessing:'''</big>
+* Using the '''''Ignore Control Characters''''' property, Grooper can transform this text by removing the specified unwanted characters, resulting in cleaner output.
 <pre>
 Invoice Number:
@@ Line 104: / Line 126: @@
 </pre>
-Using the '''''Ignore Control Characters''''' property, Grooper can transform this text by removing the specified unwanted characters, resulting in cleaner output.
 </div>
-==== Advantages ====
-* '''Reduction of Noise:''' In specific situations, helps in eliminating non-essential characters that disrupt data integrity.
-* '''Streamlined Data Parsing:''' Makes parsing and interpreting text data more straightforward by removing elements that do not contribute to the actual information.
-* '''Enhanced Accuracy:''' When needed, can improve the accuracy of text extraction by focusing only on relevant characters.
 </div>
 </div>