Tab Marking

From Grooper Wiki
Jump to navigation Jump to search
Tab Marking Icon 5.png

Tab Marking allows you to insert tab characters into a document's text data.

The Tab Marking property enables tab characters for regular expression pattern matching. These characters are inserted into a document's text data wherever there is a large gap of space between characters on a line. Space characters are converted to tab characters based on the width of the gap between text characters.


About

Normally, a space is a space is a space. Whether a space between characters, a space between columns, or any other space between characters, those spaces are represented by a single space character in a document's text data.

However, often, knowing there's a large amount of space one one or both sides of a label or value can be useful information for how to extract that data. The image here has three columns each with pairs of numbers.

Tab Marking 01.png

You can visually differentiate between the numbers in the second column from the others based on the spatial context around it. The numbers in this columns have a large amount of space on either side between them and the numbers in the other columns.

Tab Marking 02.png

However, with default extractor settings, there's no differentiation between the spaces between words and large spaces between the columns. We call words, phrases, numbers or other data separated by large amounts of space like this "segments".

As is, it would be cumbersome to write a regex pattern to differentiate between the pairs of numbers (or other "segments" on the page).

Tab Marking 03.png

With the Tab Marking property enabled, tab characters are inserted wherever there is a large gap between segments. This is the <\t> character you now see in the text.

Now, we have a character regular expression can use to pattern match the large white space on either side of the segments in the second column. The tab characters can act as anchors to help us locate what we want on a document.

Tab Marking 04.png

We can now easily create an extractor to return just the pairs of digits in the center column.

For the Value Pattern we have \d+ \d+ to pick up the digits, and the Prefix and Suffix Patterns are using the tab characters as anchors. With Tab Marking enabled, \t in a regex pattern will match the <\t> inserted into the text data. So, we are only matching pairs of digits, with a tab character before and after them.

For more information on how to enable and configure the Tab Marking property, visit the How To section of this article.


Note: In the text data, the tab character is represented by <\t>. As far as regex is concerned \t is the way to match this character not <\t>. You may also see several spaces after the tab character in the text data. These spaces are there to make the text data more readable, but are not actually part of the text data. You only need to match the tab character.

Tab Marking 05.png

Use Cases

WIP This section is a work-in-progress and may abruptly stop.

Tab Marking uses are myriad. Essentially, any time you need to match a text segment by anchoring it to a large amount of white space on either side, you will use tab characters in your regex pattern to do so. However, there are a few very common uses where tab characters pop up often.

Segment Extraction

Structured Form Extraction

Table Extraction

How To

Enable Tab Marking

Where to Begin

Tab Marking is enabled on the Pattern properties of Data Types, Data Format and objects using an Internal or Text Pattern extractor. Long story short, any time you can get to a Pattern Editor, you can enable Tab Marking

Notice here, our pattern is not matching anything, even though we have \t in the Prefix and Suffix Patterns. That's because we have not enabled Tab Marking yet. By default, these characters are not inserted into the text data. Since those characters aren't there, the pattern doesn't match.

Tab Marking - How To Enable 01.png

Enable the Tab Marking Property

Once you're on a Pattern Editor in Grooper, you can turn on Tab Marking with the "Properties" tab.

  1. Navigate to the "Properties" tab.
  2. Expand the Preprocessing Options property.
  3. Select the Tab Marking property.
  4. Change the property from Disabled to Enabled.

Tab Marking - How To Enable 02.png

Verify Tab Characters Are Inserted

With Tab Marking enabled, tab characters replace single spaces in the document's text data wherever there is a long horizontal gap between characters.

We can see now, our pattern matches. With tab characters on either side of the second column segments, the \t regex in the Prefix and Suffix Pattern now matches, and the digits captured by the Value Pattern \d+ \d+ are returned.


Note: In the text data, the tab character is represented by <\t>. As far as regex is concerned \t is the way to match this character not <\t>. You may also see several spaces after the tab character in the text data. These spaces are there to make the text data more readable, but are not actually part of the text data. You only need to match the tab character.

Tab Marking - How To Enable 03.png

Configure Tab Marking