2023.1:Split (Collation Provider): Difference between revisions

Revision as of 08:05, 24 April 2024

This article is about an older version of Grooper.

Information may be out of date and UI elements may have changed.

2025

2023.1

WIP

This article is a work-in-progress or created as a placeholder for testing purposes. This article is subject to change and/or expansion. It may be incomplete, inaccurate, or stop abruptly.

This tag will be removed upon draft completion.

Split is one of many Collation Providers you can use in Grooper to combine or organize extracted data based on the data's layout relationship. It is used to divide up a page into smaller sections, allowing you to extract from those sections rather than the whole page.

You may download the ZIP(s) below and upload it into your own Grooper environment (version 2023.1). The first contains one or more Batches of sample documents. The second contains one or more Projects with resources used in examples throughout this article.

About

The Split Collation Provider is a tool used to divide up a document into smaller sections. This allows you to extract text from a smaller section rather than the whole page.

The Provider splits the page based on what the Data Type is extracting and the configured Split Position property. There are four different positions to consider:

Begin

End

Between

Around

How To

Begin

Start with a Data Type with a child Data Type or Value Reader.
Configure the child object with an extractor to collect the text where you want your extraction to begin.
In our example below, the extractor is returning three results on the page.

Navigate to the parent Data Type.
By default the Collation property is set to Individual.
Only the headers in our example are currently being returned.

Change the Collation property to Split.
The Split Position property defaults to Begin.
Now all of the text beginning at the header is returned as a single result until it finds another extracted header label (or reaches the end of the document). In our example, we have three headers and so we get three results.

End

Start with a Data Type with a child Data Type or Value Reader.
Configure the child object with an extractor to collect the text where you want your extraction to end.
In our example below, the extractor is returning three results on the page.

Navigate to the parent Data Type.
Change the Collation property to Split.

Change the Split Position property to End.
Now Grooper will look at the returned values and collect everything before it until it runs into another extracted header or the beginning of the document.

Between

Start with a Data Type with a child Data Type or Value Reader.
Configure the child object with an extractor to collect the text where you want your extraction to begin and end.
In our example below, the extractor is returning two results on the page.

Navigate to the parent Data Type.
Change the Collation property to Split.

Change the Split Position property to Between.
Now Grooper will look at the returned values and collect everything between the two headers.
You can click on the inspection icon in the bottom right corner of the Document Viewer.

When the Inspector window pops up, we can see the whole of what is being extracted on the "Text Value" tab located under the Document Viewer.

Around

Start with a Data Type with a child Data Type or Value Reader.
Configure the child object with an extractor to collect the text you want to extract around.
The extractor is only returning one result from the page.

Navigate to the parent Data Type.
Change the Collation property to Split.

Change the Split Position property to Around.
Now Grooper all text located around the label will be returned, but the label itself will not.

@@ Line 10: / Line 10: @@
 |}
-<blockquote>{{#lst:Glossary|Split}}</blockquote>
+<blockquote>
+'''''Split''''' is one of many '''''Collation Providers''''' you can use in Grooper to combine or organize extracted data based on the data's layout relationship. It is used to divide up a page into smaller sections, allowing you to extract from those sections rather than the whole page.
+</blockquote>
 {|class="download-box"
@@ Line 27: / Line 29: @@
 The '''''Provider''''' splits the page based on what the '''Data Type''' is extracting and the configured '''''Split Position''''' property. There are four different positions to consider:
+{| style="padding: 10px"
+|
 * Begin
+[[File:2023.1 Split-(Collation-Provider) 01 Examples 01.png]]
+| style="padding: 50px" |
 * End
+[[File:2023.1 Split-(Collation-Provider) 01 Examples 02.png]]
+|-
+|
 * Between
+[[File:2023.1 Split-(Collation-Provider) 01 Examples 03.png]]
+| style="padding: 50px" |
 * Around
+[[File:2023.1 Split-(Collation-Provider) 01 Examples 04.png]]
+|}
 == How To ==
 === Begin ===
+# Start with a '''Data Type''' with a child '''Data Type''' or '''Value Reader'''.
+# Configure the child object with an extractor to collect the text where you want your extraction to begin.
+# In our example below, the extractor is returning three results on the page.
+[[File:2023.1 Split-(Collation-Provider) 02 01 Begin 01.png]]
+#<li value=4> Navigate to the parent '''Data Type'''.
+# By default the '''''Collation''''' property is set to ''Individual''.
+# Only the headers in our example are currently being returned.
+[[File:2023.1 Split-(Collation-Provider) 02 01 Begin 02.png]]
+#<li value=7> Change the '''''Collation''''' property to ''Split''.
+# The '''''Split Position''''' property defaults to ''Begin''.
+# Now all of the text beginning at the header is returned as a single result until it finds another extracted header label (or reaches the end of the document). In our example, we have three headers and so we get three results.
+[[File:2023.1 Split-(Collation-Provider) 02 01 Begin 03.png]]
 === End ===
+# Start with a '''Data Type''' with a child '''Data Type''' or '''Value Reader'''.
+# Configure the child object with an extractor to collect the text where you want your extraction to end.
+# In our example below, the extractor is returning three results on the page.
+[[File:2023.1 Split-(Collation-Provider) 02 02 End 01.png]]
+#<li value=4> Navigate to the parent '''Data Type'''.
+# Change the '''''Collation''''' property to ''Split''.
+[[File:2023.1 Split-(Collation-Provider) 02 02 End 02.png]]
+#<li value=6> Change the '''''Split Position''''' property to ''End''.
+# Now Grooper will look at the returned values and collect everything before it until it runs into another extracted header or the beginning of the document.
+[[File:2023.1 Split-(Collation-Provider) 02 02 End 03.png]]
 === Between ===
+# Start with a '''Data Type''' with a child '''Data Type''' or '''Value Reader'''.
+# Configure the child object with an extractor to collect the text where you want your extraction to begin and end.
+# In our example below, the extractor is returning two results on the page.
+[[File:2023.1 Split-(Collation-Provider) 02 03 Between 01.png]]
+#<li value=4> Navigate to the parent '''Data Type'''.
+# Change the '''''Collation''''' property to ''Split''.
+[[File:2023.1 Split-(Collation-Provider) 02 03 Between 02.png]]
+#<li value=6> Change the '''''Split Position''''' property to ''Between''.
+# Now Grooper will look at the returned values and collect everything between the two headers.
+# You can click on the inspection icon in the bottom right corner of the Document Viewer.
+[[File:2023.1 Split-(Collation-Provider) 02 03 Between 03.png]]
+#<li value=9> When the Inspector window pops up, we can see the whole of what is being extracted on the "Text Value" tab located under the Document Viewer.
+[[File:2023.1 Split-(Collation-Provider) 02 03 Between 04.png]]
 === Around ===
+# Start with a '''Data Type''' with a child '''Data Type''' or '''Value Reader'''.
+# Configure the child object with an extractor to collect the text you want to extract around.
+# The extractor is only returning one result from the page.
+[[File:2023.1 Split-(Collation-Provider) 02 04 Around 01.png]]
+#<li value=4> Navigate to the parent '''Data Type'''.
+# Change the '''''Collation''''' property to ''Split''.
+[[File:2023.1 Split-(Collation-Provider) 02 04 Around 02.png]]
+#<li value=6> Change the '''''Split Position''''' property to ''Around''.
+# Now Grooper all text located around the label will be returned, but the label itself will not.
+[[File:2023.1 Split-(Collation-Provider) 02 04 Around 03.png]]