2024:Tab Marking (Property)

From Grooper Wiki

This article is about an older version of Grooper.

Information may be out of date and UI elements may have changed.

202520242.90

Tab Marking allows you to insert tab characters into a document's text data.

The Tab Marking property enables tab characters for regular expression pattern matching. These characters are inserted into a document's text data wherever there is a large gap of space between characters on a line. Space characters are converted to tab characters based on the width of the gap between text characters.

You may download and import the file below into your own Grooper environment (version 2024). This contains a Batch with the example document(s) discussed in this article and a Project containing a Content Model configured according to its instructions.

About

Normally, a space is a space is a space. Whether a space between characters, a space between columns, or any other space between characters, those spaces are represented by a single space character in a document's text data.

However, often, knowing there's a large amount of space one one or both sides of a label or value can be useful information for how to extract that data. The image here has three columns each with pairs of numbers.

You can visually differentiate between the numbers in the second column from the others based on the spatial context around it. The numbers in this columns have a large amount of space on either side between them and the numbers in the other columns.

However, with default extractor settings, there's no differentiation between the spaces between words and large spaces between the columns. We call words, phrases, numbers or other data separated by large amounts of space like this "segments".

As is, it would be cumbersome to write a regex pattern to differentiate between the pairs of numbers (or other "segments" on the page).

With the Tab Marking property enabled, tab characters are inserted wherever there is a large gap between segments. This is the character you now see in the text.

Now, we have a character regular expression can use to pattern match the large white space on either side of the segments in the second column. The tab characters can act as anchors to help us locate what we want on a document.

We can now easily create an extractor to return just the pairs of digits in the center column.

For the Value Pattern we have \d+ \d+ to pick up the digits, and the Prefix and Suffix Patterns are using the tab characters as anchors. With Tab Marking enabled, \t in a regex pattern will match the inserted into the text data. So, we are only matching pairs of digits, with a tab character before and after them.

For more information on how to enable and configure the Tab Marking property, visit the How To section of this article.


Note: In the "Processed Text" view of the Document Viewer, the tab character is represented by . As far as regex is concerned \t is the way to match this character not . You may also see several spaces after the tab character in the text data. These spaces are there to make the text data more readable, but are not actually part of the text data. You only need to match the tab character.

Tab Marking uses potentially infinite. Essentially, any time you need to match a text segment by anchoring it to a large amount of white space on either side, you will use tab characters in your regex pattern to do so.

How To

Enable Tab Marking

Where to Begin

Tab Marking is enabled within the Preprocessing grouping of properties. Preprocessing is used all over Grooper, but is most easily associated with Extractor Types like Pattern Match or List Match.

  1. Right-click on your Project, or a folder within the Project.
  2. From the pop-out menu select "Add > Value Reader".
  3. In the "Add" window name the Value Reader.
  4. Click the "Execute" button.


  1. With the Value Reader created, click the drop-down for the Extractor property.
  2. From the drop-down menu, select Pattern Match.


  1. Once set to an Extractor Type, click the ellipsis button on the Extractor property to open the "Extractor" window.


  1. In the "Extractor" window insert the following into the Value Pattern field:
\d+ \t+
  1. Insert the following into the Prefix Pattern field:
\t
  1. Insert the following into the Suffix Pattern field:
\t

Notice here, our pattern is not matching anything, even though we have \t in the Prefix and Suffix Patterns. That's because we have not enabled Tab Marking yet. By default, these characters are not inserted into the text data. Since those characters aren't there, the pattern doesn't match.


Enable the Tab Marking Property

  1. In the "Extractor" window click on the "Properties" tab.
  2. Enable the "Auto Extract" toggle to see immediate extraction results.
  3. Ensure an appropriate Document from an appropriate Batch is selected.
  4. Click the drop-down arrow for the Preprocessing property to expand its sub-properties.
  5. Click the check box for the Tab Marking sub-property to enable Tab Marking.
  6. The Preprocessing property will update to show enabled features.
  7. Results will be displayed in the Document Viewer and Results List.

Check "Processed Text" View

With Tab Marking enabled, tab characters replace single spaces in the document's text data wherever there is a long horizontal gap between characters.

Note: In the "Processed Text" view of the Document Viewer, the tab character is represented by . As far as regex is concerned \t is the way to match this character not . You may also see several spaces after the tab character in the text data. These spaces are there to make the text data more readable, but are not actually part of the text data. You only need to match the tab character.

  1. In the "Extractor" window, click on the "Viewer" dropdown in the top-right of the Document Viewer.
  2. In the drop-down menu choose the "Processed Text" option.
  3. The Document Viewer will be updated to display the "Processed Text".
    • The goal of this view is to display special characters to represent features from the Preprocessed property and its sub-properties. In this case, an arrow icon will be used in place of tab \t characters.

Configure Tab Marking

With Tab Marking' enabled you will gain access to its sub-properties, which are access by clicking the drop-down arrow to the left of the Tab Marking property. All of these properties are configured by entering in the appropriate type of value into the property's field, or using a check box to enable them. They are as follows:

  • Minimum Tab With
This property defines the minimum width of a whitespace gap that should be converted into a tab \t character. Specifically, any whitespace gap with a width equal to or greater than the specified value will be converted to a tab \t character.
To use this property effectively, you would set a specific measurement for the minimum tab width. If a horizontal gap is larger than this measurement, it will be considered a tab; otherwise, it will remain a space. Adjusting this value can help in breaking up text segments more accurately, especially when dealing with documents that have varying font sizes and spacing.
  • Character Size Ratio
This property determines the minimum width of a whitespace gap to convert into a tab character, expressed as a percentage of the text height. Specifically, any whitespace gap with a width equal to or greater than the product of the text height and the character size ratio will be converted to tab characters.
This property is particularly useful when dealing with documents that have varying font sizes. It allows for a more dynamic approach to tab marking by considering the height of the text preceding the whitespace. For example, if the character size ratio is set to 200%, a horizontal gap must be at least twice the height of the preceding text to be considered a tab. This helps in accurately segmenting text data by accounting for different font sizes, ensuring that only significant gaps are converted to tabs.
  • Font Size Threshold
This property specifies the percentage change in font size that should trigger the insertion of a TAB character. Essentially, if the font size changes by the specified percentage or more, a TAB character will be inserted at that point.
This property is useful when processing documents where font size changes are significant indicators of data separation. For example, in a document where different sections or fields are marked by changes in font size, setting a "Font Size Threshold" allows Grooper to insert tab \t characters at these points, aiding in the segmentation of text data. This can be particularly helpful in distinguishing between different data fields or sections that are visually separated by font size changes.
  • Detection Options
    • Vertical Lines
This property enables the detection of vertical lines that interrupt a text segment. When this option is enabled, Grooper inserts tab characters at each position where a vertical line intersects a line of text. Note, that the vertical lines need to have been detected via an IP profile with a Line Removal (or Line Detection) IP Step used during the Recognize activity, thus creating line data in the layout data of a page.
For example, consider a document with a table where vertical lines separate the columns, and the values of those columns are close together in physical dimension on the document. By enabling vertical line detection, Grooper can insert tab characters at these intersections, allowing extractors to more accurately segment and extract data from each column. This feature is beneficial for processing densely populated documents where vertical lines are used to organize information.