2023.1:Split (Collation Provider): Difference between revisions

Revision as of 10:22, 27 August 2024

This article is about an older version of Grooper.

Information may be out of date and UI elements may have changed.

2025

2023.1

Split is a Collation Provider option for pin Data Type extractors. Split separates a data instance at each match returned by the Data Type. The results are used as anchor points to "split" text into one or more smaller parts.

You may download the ZIP(s) below and upload it into your own Grooper environment (version 2023.1). The first contains one or more Batches of sample documents. The second contains one or more Projects with resources used in examples throughout this article.

About

The Split Collation Provider is a tool used to divide up a document into smaller sections. This allows you to extract text from a smaller section rather than the whole page.

The Provider splits the page based on what the Data Type is extracting and the configured Split Position property. There are four different positions to consider:

Begin - Grooper will start at a header label or pattern match you specify and collect all text that comes after it as a single result (including the label or pattern) until it comes across the next instance of the label/pattern or reaches the end of the document.	End - Grooper will start at a header label or pattern match you specify and collect all text that comes before it as a single result (including the label or pattern) until it finds another match for the header/pattern or reaches the beginning of the document.

Between - Grooper will collect all text that comes between two label or pattern matches that you specify. In this case, the label/pattern is not included in the result (here "BETWEEN LABEL OR PATTERN" is being used as the pattern match).	Around - Grooper will collect all text that occurs "around" the label or pattern match you specify, but will exclude the label/pattern from the result.

How To

Begin

Start with a Data Type with a child Data Type or Value Reader.
Configure the child object with an extractor to collect the text where you want your extraction to begin.
In our example below, the extractor is returning three results on the page.

Navigate to the parent Data Type.
By default the Collation property is set to Individual.
Only the headers in our example are currently being returned.

Change the Collation property to Split.
The Split Position property defaults to Begin.
Now all of the text beginning at the header is returned as a single result until it finds another extracted header label (or reaches the end of the document). In our example, we have three headers and so we get three results.

End

Start with a Data Type with a child Data Type or Value Reader.
Configure the child object with an extractor to collect the text where you want your extraction to end.
In our example below, the extractor is returning three results on the page.

Navigate to the parent Data Type.
Change the Collation property to Split.

Change the Split Position property to End.
Now Grooper will look at the returned values and collect everything before it until it runs into another extracted header or the beginning of the document.

Between

Start with a Data Type with a child Data Type or Value Reader.
Configure the child object with an extractor to collect the text where you want your extraction to begin and end.
In our example below, the extractor is returning two results on the page.

Navigate to the parent Data Type.
Change the Collation property to Split.

Change the Split Position property to Between.
Now Grooper will look at the returned values and collect everything between the two headers.
You can click on the inspection icon in the bottom right corner of the Document Viewer.

When the Inspector window pops up, we can see the whole of what is being extracted on the "Text Value" tab located under the Document Viewer.

Around

Start with a Data Type with a child Data Type or Value Reader.
Configure the child object with an extractor to collect the text you want to extract around.
The extractor is only returning one result from the page.

Navigate to the parent Data Type.
Change the Collation property to Split.

Change the Split Position property to Around.
Now Grooper all text located around the label will be returned, but the label itself will not.

Glossary

Batch: inventory_2 Batch nodes are fundamental in Grooper's architecture. They are containers of documents that are moved through workflow mechanisms called settings Batch Processes. Documents and their pages are represented in Batches by a hierarchy of folder Batch Folders and contract Batch Pages.

Collation Provider: The Collation property of a pin Data Type defines the method for converting its raw results into a final result set. It is configured by selecting a Collation Provider. The Collation Provider governs how initial matches from the Data Type's extractor(s) are combined and interpreted to produce the Data Type's final output.

Data Type: pin Data Types are nodes used to extract text data from a document. Data Types have more capabilities than quick_reference_all Value Readers. Data Types can collect results from multiple extractor sources, including a locally defined extractor, child extractor nodes, and referenced extractor nodes. Data Types can also collate results using Collation Providers to combine, sift and manipulate results further.

Document Viewer: The Grooper Document Viewer is the portal to your documents. It is the UI that allows you to see a folder Batch Folder's (or a contract Batch Page's) image, text content, and more.

Project: package_2 Projects are the primary containers for configuration nodes within Grooper. The Project is where various processing objects such as stacks Content Models, settings Batch Processes, profile objects are stored. This makes resources easier to manage, easier to save, and simplifies how node references are made in a Grooper Repository.

Split: Split is a Collation Provider option for pin Data Type extractors. Split separates a data instance at each match returned by the Data Type. The results are used as anchor points to "split" text into one or more smaller parts.

Value Reader: quick_reference_all Value Reader nodes define a single data extraction operation. Each Value Reader executes a single Value Extractor configuration. The Value Extractor determines the logic for returning data from a text-based document or page. (Example: Pattern Match is a Value Extractor that returns data using regular expressions).

Value Readers are can be used on their own or in conjunction with pin Data Types for more complex data extraction and collation.

@@ Line 13: / Line 13: @@
 * [[Media:2023.1 Wiki Split-(Collation-Provider) Project.zip]]
 |}
-== Glossary ==
-<u><big>'''Batch'''</big></u>: {{#lst:Glossary|Batch}}
-<u><big>'''Collation Provider'''</big></u>: {{#lst:Glossary|Collation Provider}}
-<u><big>'''Data Type'''</big></u>: {{#lst:Glossary|Data Type}}
-<u><big>'''Document Viewer'''</big></u>: {{#lst:Glossary|Document Viewer}}
-<u><big>'''Project'''</big></u>: {{#lst:Glossary|Project}}
-<u><big>'''Split'''</big></u>: {{#lst:Glossary|Split}}
-<u><big>'''Value Reader'''</big></u>: {{#lst:Glossary|Value Reader}}
 == About ==
@@ Line 149: / Line 134: @@
 [[File:2023.1 Split-(Collation-Provider) 02 04 Around 03.png]]
+== Glossary ==
+<u><big>'''Batch'''</big></u>: {{#lst:Glossary|Batch}}
+<u><big>'''Collation Provider'''</big></u>: {{#lst:Glossary|Collation Provider}}
+<u><big>'''Data Type'''</big></u>: {{#lst:Glossary|Data Type}}
+<u><big>'''Document Viewer'''</big></u>: {{#lst:Glossary|Document Viewer}}
+<u><big>'''Project'''</big></u>: {{#lst:Glossary|Project}}
+<u><big>'''Split'''</big></u>: {{#lst:Glossary|Split}}
+<u><big>'''Value Reader'''</big></u>: {{#lst:Glossary|Value Reader}}