2023.1:Multi-Column (Collation Provider): Difference between revisions

From Grooper Wiki
block quote // via Wikitext Extension for VSCode
No edit summary
 
(7 intermediate revisions by 3 users not shown)
Line 1: Line 1:
{{AutoVersion}}
{{AutoVersion}}


{|class="wip-box"
|
'''WIP'''
|
This article is a work-in-progress or created as a placeholder for testing purposes.  This article is subject to change and/or expansion.  It may be incomplete, inaccurate, or stop abruptly.
This tag will be removed upon draft completion.
|}


<blockquote>
<blockquote>
Line 19: Line 11:
|
|
You may download the ZIP(s) below and upload it into your own Grooper environment (version 2023.1). The first contains one or more '''Batches''' of sample documents.  The second contains one or more '''Projects''' with resources used in examples throughout this article.  
You may download the ZIP(s) below and upload it into your own Grooper environment (version 2023.1). The first contains one or more '''Batches''' of sample documents.  The second contains one or more '''Projects''' with resources used in examples throughout this article.  
* [[Media:2023.1 Wiki Multi-Column-(Collation-Provider) Batches.zip]]
* [[Media:2023.1 Wiki Multi-Column-(Collation-Provider) Batch.zip]]
* [[Media:2023.1 Wiki Multi-Column-(Collation-Provider) Project.zip]]
* [[Media:2023.1 Wiki Multi-Column-(Collation-Provider) Project.zip]]
|}
|}


== About ==
== About ==
Sometimes you might run into documents with text that is divided up into columns:
[[File:2023.1 Multi-Column-(Collation-Provider) 01 About 01.png|center|500px]]
Grooper cannot intuitively determine when a page is divided into columns rather than just being continuous text. We need to tell Grooper to expect multiple columns using the ''Multi-Column'' '''''Collation Provider'''''.
{|class="attn-box"
|
&#9888;
|
'''''BE AWARE: The Multi-Column provider has its limitations.'''''
Multi-Column is an older Collation Provider that never got wide use or adoption in Grooper. As such, it is underdeveloped compared to the rest of Grooper's Collation Providers.
'''''BE AWARE: The Multi-Column provider can only collect two (2) columns.'''''
Multi-Column collation will only work for 2 column layouts. It is not suited for three or more columns of text.
|}
== How To ==
We are going to go over the basics of setting up the ''Multi-Column'' '''''Collation Provider'''''. There are many options located under the '''''Collation''''' property after selecting ''Multi-Column'' that you can adjust to improve your results beyond what we will discuss here.
=== Set Up the Provider ===
# The page in our [[image:GrooperIcon_Batch.png]]'''Batch''' has two columns on the page. The text in the first column is continued on the second.
# Create a '''Data Type'''.
[[File:2023.1 Multi-Column-(Collation-Provider) 02 01 Setting-the-Collation 01.png]]
#<li value=3> Set the '''''Local Extractor''''' for the '''Data Type'''. In this example we are setting it to a ''Pattern Match''.
[[File:2023.1 Multi-Column-(Collation-Provider) 02 01 Setting-the-Collation 02.png]]
#<li value=4> In our example we have configured our ''Pattern Match'' with the regex pattern <code>[^\r\n\t\f]+</code> to collect all lines of text on the page.
#* You need to turn on '''''Tab Marking''''' for this pattern to work.
[[File:2023.1 Multi-Column-(Collation-Provider) 02 01 Setting-the-Collation 03.png]]
<big>'''Turning on Tab Marking'''</big>
# Click on the "Properties" tab.
# Open up the '''''Preprocessing''''' options.
# Click the check box to the right of '''''Tab Marking''''' to enable the property.
[[File:2023.1 Multi-Column-(Collation-Provider) 02 01 Setting-the-Collation 04.png]]
<big>'''Setting the Provider'''</big>
# Set the '''''Collation''''' property to ''Multi-Column''.
# It may look like the whole page is being extracted straight across, but Grooper is now collecting the individual columns.
# Click the Inspection icon located to the bottom right of the Document Viewer.
[[File:2023.1 Multi-Column-(Collation-Provider) 02 01 Setting-the-Collation 05.png]]
#<li value=4> Now you can see, in the "Text Value" tab below the Document Viewer on the Inspection page, that the text in the first column is collected first before Grooper collects the second column.
[[File:2023.1 Multi-Column-(Collation-Provider) 02 01 Setting-the-Collation 06.png]]

Latest revision as of 13:58, 23 April 2025

This article is about an older version of Grooper.

Information may be out of date and UI elements may have changed.

20252023.1


Multi-Column is a Collation Provider option for pin Data Type extractors. Multi-Column combines multiple columns on a page into a single column for extraction.

You may download the ZIP(s) below and upload it into your own Grooper environment (version 2023.1). The first contains one or more Batches of sample documents. The second contains one or more Projects with resources used in examples throughout this article.

About

Sometimes you might run into documents with text that is divided up into columns:


Grooper cannot intuitively determine when a page is divided into columns rather than just being continuous text. We need to tell Grooper to expect multiple columns using the Multi-Column Collation Provider.

BE AWARE: The Multi-Column provider has its limitations.

Multi-Column is an older Collation Provider that never got wide use or adoption in Grooper. As such, it is underdeveloped compared to the rest of Grooper's Collation Providers.

BE AWARE: The Multi-Column provider can only collect two (2) columns.

Multi-Column collation will only work for 2 column layouts. It is not suited for three or more columns of text.

How To

We are going to go over the basics of setting up the Multi-Column Collation Provider. There are many options located under the Collation property after selecting Multi-Column that you can adjust to improve your results beyond what we will discuss here.

Set Up the Provider

  1. The page in our Batch has two columns on the page. The text in the first column is continued on the second.
  2. Create a Data Type.


  1. Set the Local Extractor for the Data Type. In this example we are setting it to a Pattern Match.


  1. In our example we have configured our Pattern Match with the regex pattern [^\r\n\t\f]+ to collect all lines of text on the page.
    • You need to turn on Tab Marking for this pattern to work.


Turning on Tab Marking

  1. Click on the "Properties" tab.
  2. Open up the Preprocessing options.
  3. Click the check box to the right of Tab Marking to enable the property.


Setting the Provider

  1. Set the Collation property to Multi-Column.
  2. It may look like the whole page is being extracted straight across, but Grooper is now collecting the individual columns.
  3. Click the Inspection icon located to the bottom right of the Document Viewer.


  1. Now you can see, in the "Text Value" tab below the Document Viewer on the Inspection page, that the text in the first column is collected first before Grooper collects the second column.