2023.1:Multi-Column (Collation Provider): Difference between revisions

Revision as of 09:52, 27 August 2024

This article is about an older version of Grooper.

Information may be out of date and UI elements may have changed.

2025

2023.1

Multi-Column is a Collation Provider option for pin Data Type extractors. Multi-Column combines multiple columns on a page into a single column for extraction.

You may download the ZIP(s) below and upload it into your own Grooper environment (version 2023.1). The first contains one or more Batches of sample documents. The second contains one or more Projects with resources used in examples throughout this article.

About

Sometimes you might run into documents with text that is divided up into columns:

Grooper cannot intuitively determine when a page is divided into columns rather than just being continuous text. We need to tell Grooper to expect multiple columns using the Multi-Column Collation Provider.

How To

We are going to go over the basics of setting up the Multi-Column Collation Provider. There are many options located under the Collation property after selecting Multi-Column that you can adjust to improve your results beyond what we will discuss here.

Set Up the Provider

The page in our Batch has two columns on the page. The text in the first column is continued on the second.
Create a Data Type.

Set the Local Extractor for the Data Type. In this example we are setting it to a Pattern Match.

In our example we have configured our Pattern Match with the regex pattern [^\r\n\t\f]+ to collect all lines of text on the page.
- You need to turn on Tab Marking for this pattern to work.

Turning on Tab Marking

Click on the "Properties" tab.
Open up the Preprocessing options.
Click the check box to the right of Tab Marking to enable the property.

Setting the Provider

Set the Collation property to Multi-Column.
It may look like teh whole page is being extracted straight across, but Grooper is now collecting the individual columns.
Click the Inspection icon located to the bottom right of the Document Viewer.

Now you can see, in the "Text Value" tab below the Document Viewer on the Inspection page, that the text in the first column is collected first before Grooper collects the second column.

Glossary

Batch: inventory_2 Batch nodes are fundamental in Grooper's architecture. They are containers of documents that are moved through workflow mechanisms called settings Batch Processes. Documents and their pages are represented in Batches by a hierarchy of folder Batch Folders and contract Batch Pages.

Collation Provider: The Collation property of a pin Data Type defines the method for converting its raw results into a final result set. It is configured by selecting a Collation Provider. The Collation Provider governs how initial matches from the Data Type's extractor(s) are combined and interpreted to produce the Data Type's final output.

Data Type: pin Data Types are nodes used to extract text data from a document. Data Types have more capabilities than quick_reference_all Value Readers. Data Types can collect results from multiple extractor sources, including a locally defined extractor, child extractor nodes, and referenced extractor nodes. Data Types can also collate results using Collation Providers to combine, sift and manipulate results further.

Document Viewer: The Grooper Document Viewer is the portal to your documents. It is the UI that allows you to see a folder Batch Folder's (or a contract Batch Page's) image, text content, and more.

Extract: export_notes Extract is an Activity that retrieves information from folder Batch Folder documents, as defined by Data Elements in a data_table Data Model. This is how Grooper locates unstructured data on your documents and collects it in a structured, usable format.

Multi-Column: Multi-Column is a Collation Provider option for pin Data Type extractors. Multi-Column combines multiple columns on a page into a single column for extraction.

Pattern Match: Pattern Match is a Value Extractor that extracts values from a document that match a specified regular expression, providing data collection following a known format or pattern.

Project: package_2 Projects are the primary containers for configuration nodes within Grooper. The Project is where various processing objects such as stacks Content Models, settings Batch Processes, profile objects are stored. This makes resources easier to manage, easier to save, and simplifies how node references are made in a Grooper Repository.

Tab Marking: Tab Marking allows you to insert tab characters into a document's text data.

@@ Line 14: / Line 14: @@
 * [[Media:2023.1 Wiki Multi-Column-(Collation-Provider) Project.zip]]
 |}
-== Glossary ==
-<u><big>'''Batch'''</big></u>: {{#lst:Glossary|Batch}}
-<u><big>'''Collation Provider'''</big></u>: {{#lst:Glossary|Collation Provider}}
-<u><big>'''Data Type'''</big></u>: {{#lst:Glossary|Data Type}}
-<u><big>'''Document Viewer'''</big></u>: {{#lst:Glossary|Document Viewer}}
-<u><big>'''Extract'''</big></u>: {{#lst:Glossary|Extract}}
-<u><big>'''Multi-Column'''</big></u>: {{#lst:Glossary|Multi-Column}}
-<u><big>'''Pattern Match'''</big></u>: {{#lst:Glossary|Pattern Match}}
-<u><big>'''Project'''</big></u>: {{#lst:Glossary|Project}}
-<u><big>'''Tab Marking'''</big></u>: {{#lst:Glossary|Tab Marking}}
 == About ==
@@ Line 86: / Line 67: @@
 [[File:2023.1 Multi-Column-(Collation-Provider) 02 01 Setting-the-Collation 06.png]]
+== Glossary ==
+<u><big>'''Batch'''</big></u>: {{#lst:Glossary|Batch}}
+<u><big>'''Collation Provider'''</big></u>: {{#lst:Glossary|Collation Provider}}
+<u><big>'''Data Type'''</big></u>: {{#lst:Glossary|Data Type}}
+<u><big>'''Document Viewer'''</big></u>: {{#lst:Glossary|Document Viewer}}
+<u><big>'''Extract'''</big></u>: {{#lst:Glossary|Extract}}
+<u><big>'''Multi-Column'''</big></u>: {{#lst:Glossary|Multi-Column}}
+<u><big>'''Pattern Match'''</big></u>: {{#lst:Glossary|Pattern Match}}
+<u><big>'''Project'''</big></u>: {{#lst:Glossary|Project}}
+<u><big>'''Tab Marking'''</big></u>: {{#lst:Glossary|Tab Marking}}