2.90:Tab Marking (Property): Difference between revisions

Revision as of 10:51, 13 August 2020

Tab Marking allows you to insert tab characters into a document's text data.

The Tab Marking property enables tab characters for regular expression pattern matching. These characters are inserted into a document's text data wherever there is a large gap of space between characters on a line.

About

Normally, a space is a space is a space. Whether a space between characters, a space between columns, or any other space between characters, those spaces are represented by a single space character in a document's text data.

However, often, knowing there's a large amount of space one one or both sides of a label or value can be useful information for how to extract that data. The image here has three columns each with pairs of numbers.
You can visually differentiate between the numbers in the second column from the others based on the spatial context around it. The numbers in this columns have a large amount of space on either side between them and the numbers in the other columns.
However, with default extractor settings, there's no differentiation between the spaces between words and large spaces between the columns. We call words, phrases, numbers or other data separated by large amounts of space like this "segments". As is, it would be cumbersome to write a regex pattern to differentiate between the pairs of numbers (or other "segments" on the page).
With the Tab Marking property enabled, tab characters are inserted wherever there is a large gap between segments. This is the `<\t>` character you now see in the text. Now, we have a character regular expression can use to pattern match the large white space on either side of the segments in the second column. The tab characters can act as anchors to help us locate what we want on a document.
We can now easily create an extractor to return just the pairs of digits in the center column. For the Value Pattern we have `\d+ \d+` to pick up the digits, and the Prefix and Suffix Patterns are using the tab characters as anchors. With Tab Marking enabled, `\t` in a regex pattern will match the `<\t>` inserted into the text data. So, we are only matching pairs of digits, with a tab character before and after them. For more information on how to enable and configure the Tab Marking property, visit the How To section of this article.

Use Cases

WIP

This section is a work-in-progress and may abruptly stop.

Tab Marking uses are myriad. Essentially, any time you need to match a text segment by anchoring it to a large amount of white space on either side, you will use tab characters in your regex pattern to do so. However, there are a few very common uses where tab characters pop up often.

Segment Extraction

Structured Form Extraction

Table Extraction

How To

Enabling the Tab Marking Property

Where to BeginEnable the Tab Marking PropertyVerify the Results

Where to Begin

Tab Marking is enabled on the Pattern properties of Data Types, Data Format and objects using an Internal or Text Pattern extractor. Long story short, any time you can get to a Pattern Editor, you can enable Tab Marking

Notice here, our pattern is not matching anything, even though we have \t in the Prefix and Suffix Patterns. That's because we have not enabled Tab Marking yet. By default, these characters are not inserted into the text data. Since those characters aren't there, the pattern doesn't match.

Enable the Tab Marking Property

Once you're on a Pattern Editor in Grooper, you can turn on Tab Marking with the "Properties" tab.

Navigate to the "Properties" tab.
Expand the Preprocessing Options property.
Select the Tab Marking property.
Change the property from Disabled to Enabled.

Verify Tab Characters Are Inserted

With Tab Marking enabled, tab characters replace single spaces in the document's text data wherever there is a long horizontal gap between characters.

We can see now, our pattern matches. With tab characters on either side of the second column segments, the \t regex in the Prefix and Suffix Pattern now matches, and the digits captured by the Value Pattern \d+ \d+ are returned.

File:Tab Marking - How To Enable 04.png

Configuring the Tab Marking Property

@@ Line 65: / Line 65: @@
 === Enabling the Tab Marking Property ===
+<tabs style="margin:20px">
+<tab name="Where to Begin" style="margin:20px">
+=== Where to Begin ===
+{|cellpadding=10 cellspacing=5
+|style="width:40%" valign=top|
+'''''Tab Marking''''' is enabled on the '''''Pattern''''' properties of '''Data Types''', '''Data Format''' and objects using an ''Internal'' or ''Text Pattern'' extractor.  Long story short, any time you can get to a '''''Pattern Editor''''', you can enable '''''Tab Marking'''''
+Notice here, our pattern is not matching anything, even though we have <code>\t</code> in the '''''Prefix''''' and '''''Suffix Patterns'''''.  That's because we have not enabled '''''Tab Marking''''' yet.  By default, these characters are not inserted into the text data.  Since those characters aren't there, the pattern doesn't match.
+|
+[[File:Tab Marking - How To Enable 01.png]]
+|}
+</tab>
+<tab name="Enable the Tab Marking Property" style="margin:20px">
+=== Enable the Tab Marking Property ===
+{|cellpadding=10 cellspacing=5
+|style="width:40%" valign=top|
+Once you're on a '''''Pattern Editor''''' in Grooper, you can turn on '''''Tab Marking''''' with the "Properties" tab.
+# Navigate to the "Properties" tab.
+# Expand the '''''Preprocessing Options''''' property.
+# Select the '''''Tab Marking''''' property.
+# Change the property from ''Disabled'' to ''Enabled''.
+|
+[[File:Tab Marking - How To Enable 02.png]]
+|}
+</tab>
+<tab name="Verify the Results">
+=== Verify Tab Characters Are Inserted ===
+{|cellpadding=10 cellspacing=5
+|style="width:40%" valign=top|
+With '''''Tab Marking''''' enabled, tab characters replace single spaces in the document's text data wherever there is a long horizontal gap between characters.
+We can see now, our pattern matches.  With tab characters on either side of the second column segments, the <code>\t</code> regex in the '''''Prefix''''' and '''''Suffix Pattern''''' now matches, and the digits captured by the '''''Value Pattern''''' <code>\d+ \d+</code> are returned.
+|
+[[File:Tab Marking - How To Enable 04.png]]
+</tab>
+</tabs>
 === Configuring the Tab Marking Property ===