2023.1:Split (Collation Provider)

From Grooper Wiki

This article is about an older version of Grooper.

Information may be out of date and UI elements may have changed.

20252023.1

Split is a Collation Provider option for pin Data Type extractors. Split separates a data instance at each match returned by the Data Type. The results are used as anchor points to "split" text into one or more smaller parts.

You may download the ZIP(s) below and upload it into your own Grooper environment (version 2023.1). The first contains one or more Batches of sample documents. The second contains one or more Projects with resources used in examples throughout this article.

About

The Split Collation Provider is a tool used to divide up a document into smaller sections. This allows you to extract text from a smaller section rather than the whole page.

The Provider splits the page based on what the Data Type is extracting and the configured Split Position property. There are four different positions to consider:

  • Begin - Grooper will start at a header label or pattern match you specify and collect all text that comes after it as a single result (including the label or pattern) until it comes across the next instance of the label/pattern or reaches the end of the document.
  • End - Grooper will start at a header label or pattern match you specify and collect all text that comes before it as a single result (including the label or pattern) until it finds another match for the header/pattern or reaches the beginning of the document.

  • Between - Grooper will collect all text that comes between two label or pattern matches that you specify. In this case, the label/pattern is not included in the result (here "BETWEEN LABEL OR PATTERN" is being used as the pattern match).
  • Around - Grooper will collect all text that occurs "around" the label or pattern match you specify, but will exclude the label/pattern from the result.

How To

Begin

  1. Start with a Data Type with a child Data Type or Value Reader.
  2. Configure the child object with an extractor to collect the text where you want your extraction to begin.
  3. In our example below, the extractor is returning three results on the page.


  1. Navigate to the parent Data Type.
  2. By default the Collation property is set to Individual.
  3. Only the headers in our example are currently being returned.


  1. Change the Collation property to Split.
  2. The Split Position property defaults to Begin.
  3. Now all of the text beginning at the header is returned as a single result until it finds another extracted header label (or reaches the end of the document). In our example, we have three headers and so we get three results.


End

  1. Start with a Data Type with a child Data Type or Value Reader.
  2. Configure the child object with an extractor to collect the text where you want your extraction to end.
  3. In our example below, the extractor is returning three results on the page.


  1. Navigate to the parent Data Type.
  2. Change the Collation property to Split.


  1. Change the Split Position property to End.
  2. Now Grooper will look at the returned values and collect everything before it until it runs into another extracted header or the beginning of the document.


Between

  1. Start with a Data Type with a child Data Type or Value Reader.
  2. Configure the child object with an extractor to collect the text where you want your extraction to begin and end.
  3. In our example below, the extractor is returning two results on the page.


  1. Navigate to the parent Data Type.
  2. Change the Collation property to Split.


  1. Change the Split Position property to Between.
  2. Now Grooper will look at the returned values and collect everything between the two headers.
  3. You can click on the inspection icon in the bottom right corner of the Document Viewer.


  1. When the Inspector window pops up, we can see the whole of what is being extracted on the "Text Value" tab located under the Document Viewer.


Around

  1. Start with a Data Type with a child Data Type or Value Reader.
  2. Configure the child object with an extractor to collect the text you want to extract around.
  3. The extractor is only returning one result from the page.


  1. Navigate to the parent Data Type.
  2. Change the Collation property to Split.


  1. Change the Split Position property to Around.
  2. Now Grooper all text located around the label will be returned, but the label itself will not.