2023.1:Split (Collation Provider): Difference between revisions

From Grooper Wiki
table // via Wikitext Extension for VSCode
size of images // via Wikitext Extension for VSCode
Line 31: Line 31:
{| style="padding: 10px"  
{| style="padding: 10px"  
|
|
* Begin - Grooper will start at a header label or pattern match you specify and collect all text that comes after it as a single result (including the label or pattern) until it comes across the next instance of the label/pattern or reaches the end of the document.
* '''Begin''' - Grooper will start at a header label or pattern match you specify and collect all text that comes after it as a single result (including the label or pattern) until it comes across the next instance of the label/pattern or reaches the end of the document.
| style="padding: 10px" |
| style="padding: 10px" |
* End - Grooper will start at a header label or pattern match you specify and collect all text that comes before it as a single result (including the label or pattern) until it finds another match for the header/pattern or reaches the beginning of the document.  
* '''End''' - Grooper will start at a header label or pattern match you specify and collect all text that comes before it as a single result (including the label or pattern) until it finds another match for the header/pattern or reaches the beginning of the document.  
|-
|-
| style="text-align:center" |
| style="text-align:center" |
[[File:2023.1 Split-(Collation-Provider) 01 Examples 01.png]]
[[File:2023.1 Split-(Collation-Provider) 01 Examples 01.png|700px]]
| style="text-align:center" |
| style="text-align:center" |
[[File:2023.1 Split-(Collation-Provider) 01 Examples 02.png]]
[[File:2023.1 Split-(Collation-Provider) 01 Examples 02.png|700px]]
|-
|-
|
|
* Between - Grooper will collect all text that comes between two label or pattern matches that you specify. In this case, the label/pattern is not included in the result (here "BETWEEN LABEL OR PATTERN" is being used as the pattern match).
* '''Between''' - Grooper will collect all text that comes between two label or pattern matches that you specify. In this case, the label/pattern is not included in the result (here "BETWEEN LABEL OR PATTERN" is being used as the pattern match).
| style="padding: 10px" |
| style="padding: 10px" |
* Around - Grooper will collect all text that occurs "around" the label or pattern match you specify, but will exclude the label/pattern from the result.  
* '''Around''' - Grooper will collect all text that occurs "around" the label or pattern match you specify, but will exclude the label/pattern from the result.  
|-
|-
| style="text-align:center" |
| style="text-align:center" |
[[File:2023.1 Split-(Collation-Provider) 01 Examples 03.png]]
[[File:2023.1 Split-(Collation-Provider) 01 Examples 03.png|700px]]
| style="text-align:center" |
| style="text-align:center" |
[[File:2023.1 Split-(Collation-Provider) 01 Examples 04.png]]
[[File:2023.1 Split-(Collation-Provider) 01 Examples 04.png|700px]]
|}
|}



Revision as of 09:31, 24 April 2024

This article is about an older version of Grooper.

Information may be out of date and UI elements may have changed.

20252023.1

WIP

This article is a work-in-progress or created as a placeholder for testing purposes. This article is subject to change and/or expansion. It may be incomplete, inaccurate, or stop abruptly.

This tag will be removed upon draft completion.

Split is one of many Collation Providers you can use in Grooper to combine or organize extracted data based on the data's layout relationship. It is used to divide up a page into smaller sections, allowing you to extract from those sections rather than the whole page.

You may download the ZIP(s) below and upload it into your own Grooper environment (version 2023.1). The first contains one or more Batches of sample documents. The second contains one or more Projects with resources used in examples throughout this article.

About

The Split Collation Provider is a tool used to divide up a document into smaller sections. This allows you to extract text from a smaller section rather than the whole page.

The Provider splits the page based on what the Data Type is extracting and the configured Split Position property. There are four different positions to consider:

  • Begin - Grooper will start at a header label or pattern match you specify and collect all text that comes after it as a single result (including the label or pattern) until it comes across the next instance of the label/pattern or reaches the end of the document.
  • End - Grooper will start at a header label or pattern match you specify and collect all text that comes before it as a single result (including the label or pattern) until it finds another match for the header/pattern or reaches the beginning of the document.

  • Between - Grooper will collect all text that comes between two label or pattern matches that you specify. In this case, the label/pattern is not included in the result (here "BETWEEN LABEL OR PATTERN" is being used as the pattern match).
  • Around - Grooper will collect all text that occurs "around" the label or pattern match you specify, but will exclude the label/pattern from the result.

How To

Begin

  1. Start with a Data Type with a child Data Type or Value Reader.
  2. Configure the child object with an extractor to collect the text where you want your extraction to begin.
  3. In our example below, the extractor is returning three results on the page.


  1. Navigate to the parent Data Type.
  2. By default the Collation property is set to Individual.
  3. Only the headers in our example are currently being returned.


  1. Change the Collation property to Split.
  2. The Split Position property defaults to Begin.
  3. Now all of the text beginning at the header is returned as a single result until it finds another extracted header label (or reaches the end of the document). In our example, we have three headers and so we get three results.


End

  1. Start with a Data Type with a child Data Type or Value Reader.
  2. Configure the child object with an extractor to collect the text where you want your extraction to end.
  3. In our example below, the extractor is returning three results on the page.


  1. Navigate to the parent Data Type.
  2. Change the Collation property to Split.


  1. Change the Split Position property to End.
  2. Now Grooper will look at the returned values and collect everything before it until it runs into another extracted header or the beginning of the document.


Between

  1. Start with a Data Type with a child Data Type or Value Reader.
  2. Configure the child object with an extractor to collect the text where you want your extraction to begin and end.
  3. In our example below, the extractor is returning two results on the page.


  1. Navigate to the parent Data Type.
  2. Change the Collation property to Split.


  1. Change the Split Position property to Between.
  2. Now Grooper will look at the returned values and collect everything between the two headers.
  3. You can click on the inspection icon in the bottom right corner of the Document Viewer.


  1. When the Inspector window pops up, we can see the whole of what is being extracted on the "Text Value" tab located under the Document Viewer.


Around

  1. Start with a Data Type with a child Data Type or Value Reader.
  2. Configure the child object with an extractor to collect the text you want to extract around.
  3. The extractor is only returning one result from the page.


  1. Navigate to the parent Data Type.
  2. Change the Collation property to Split.


  1. Change the Split Position property to Around.
  2. Now Grooper all text located around the label will be returned, but the label itself will not.