2023.1:Multi-Column (Collation Provider)

From Grooper Wiki

This article is about an older version of Grooper.

Information may be out of date and UI elements may have changed.

20252023.1


Multi-Column is a Collation Provider option for pin Data Type extractors. Multi-Column combines multiple columns on a page into a single column for extraction.

You may download the ZIP(s) below and upload it into your own Grooper environment (version 2023.1). The first contains one or more Batches of sample documents. The second contains one or more Projects with resources used in examples throughout this article.

About

Sometimes you might run into documents with text that is divided up into columns:


Grooper cannot intuitively determine when a page is divided into columns rather than just being continuous text. We need to tell Grooper to expect multiple columns using the Multi-Column Collation Provider.

BE AWARE: The Multi-Column provider has its limitations.

Multi-Column is an older Collation Provider that never got wide use or adoption in Grooper. As such, it is underdeveloped compared to the rest of Grooper's Collation Providers.

BE AWARE: The Multi-Column provider can only collect two (2) columns.

Multi-Column collation will only work for 2 column layouts. It is not suited for three or more columns of text.

How To

We are going to go over the basics of setting up the Multi-Column Collation Provider. There are many options located under the Collation property after selecting Multi-Column that you can adjust to improve your results beyond what we will discuss here.

Set Up the Provider

  1. The page in our Batch has two columns on the page. The text in the first column is continued on the second.
  2. Create a Data Type.


  1. Set the Local Extractor for the Data Type. In this example we are setting it to a Pattern Match.


  1. In our example we have configured our Pattern Match with the regex pattern [^\r\n\t\f]+ to collect all lines of text on the page.
    • You need to turn on Tab Marking for this pattern to work.


Turning on Tab Marking

  1. Click on the "Properties" tab.
  2. Open up the Preprocessing options.
  3. Click the check box to the right of Tab Marking to enable the property.


Setting the Provider

  1. Set the Collation property to Multi-Column.
  2. It may look like the whole page is being extracted straight across, but Grooper is now collecting the individual columns.
  3. Click the Inspection icon located to the bottom right of the Document Viewer.


  1. Now you can see, in the "Text Value" tab below the Document Viewer on the Inspection page, that the text in the first column is collected first before Grooper collects the second column.