Pattern-Based Separation (Separation Provider)

From Grooper Wiki
(Redirected from Pattern-Based Separation)

This article was migrated from an older version and has not been updated for the current version of Grooper.

This tag will be removed upon article review and update.

This article is about the current version of Grooper.

Note that some content may still need to be updated.

2025 2023.1

Pattern-Based Separation is a Separation Provider that creates a new document folder every time a value returned by a defined pattern is encountered on a page.

You may download the ZIP(s) below and upload it into your own Grooper environment (version 2023). The first contains one or more Batches of sample documents. The second contains one or more Projects with resources used in examples throughout this article.

About

The Pattern-Based Separation Provider separates documents based on whether or not a defined pattern returns a value from a page in your Batch.

A Data Extractor is used to find a value on a page. When the extractor returns a result on a page, the page is placed in a new folder, creating a new document. If the extractor does not return a result on the following page, that page is included behind the previous page in the newly created folder. Once the extractor does produce a result on a subsequent page (even if it is the same result as the previous page) it will be placed in a new folder, creating a new document.



How To

Pattern-Based Separation is achieved through setting up an extractor that will return results ONLY on the pages where you wish separation to occur. One of the simplest ways to do this is to use a document title for the extractor, but any result on the page will work. In the simple example below, we will walk you through how to set up the Pattern-Based Separation Provider using a List Match.

Simple Example

  1. Add a Separate Batch Process Step to your Batch Process.
  2. Set the Provider property to Pattern-Based Separation.


  1. When the "Provider" window pops up, click the hamburger icon next to the Value Extractor property to access the drop-down menu and select a value extractor.
  2. For this tutorial we are going to use a List Match, but you can use any value extractor you wish.


  1. Once your extractor is selected, click the ellipsis button to the right of the property.


  1. When the "Value Extractor" window pops up, configure your extractor. Here we have entered in the titles located on the first page of each document.
  2. Click "OK" in the top right of the window when you are finished configuring your extractor.


  1. Click "OK" on the "Provider" window to save your changes.


  1. Click over to the "Activity Tester" tab to test separation.
  2. Select the Batch Folder in the Batch Viewer.
  3. Click the play button in the top right corner of the Batch Viewer to test.


  1. In the screenshot below, you can see that Grooper has created folders and separated the documents appropriately.

Practical Example

In this next example, we are going to walk through the steps of setting up the Pattern-Based Separation Provider again, but using a more practical real-world example. We will also find that when we test separation, we run into a slight issue.

  1. For this tutorial we have started by following the same steps as in the previous simple example. For our List Match, we have enetered in titles that can be found on the first page of each document.


  1. With the Separation Provider set, click on the "Activity Tester" tab to test separation.
  2. Select the Batch Folder in the Batch Viewer containing the pages you want to separate.
  3. The click the play button in the top right corner of the Batch Viewer to test separation.


  1. At first glance, it may look like Grooper did a good job of separating the Batch.


  1. Upon closer inspecion, we see tha the second page of the W-4 document was incorrectly separated out.
  2. This is because the title of the docuent also appears on the second page of the W-4.


  1. Go ahead and undo separation by selecting all of the folders in the Batch in the level that you wan to remove and right-click.
  2. Hover over "Foldering".
  3. Click on "Remove Level".


  1. Click "EXECUTE" to apply changes.

In the next section we will discuss how to fix the issue with our W-4 document using an Exclusion Extractor.


Exclusion Extractor

The way the Exclusion Extractor works is simple, but can be difficult conceptually to understand at first.

In our practical example our Value Extractor for our Pattern-Based Separation Provider returned a result on two separate pages of the same document. In the tutorial below, we are going to set an Exclusion Extractor to return a value from the second page. If Grooper returns a value from the Value Extractor on a page and also returns a value from the Exclusion Extractor on the same page, it will Exclude or ignore that page entirely and will not run separation.

  1. Back in our Provider window, we have access to an Exclusion Extractor property. We're going to set this to a Pattern Match and use this to tell Grooper to exclude any pages from separation that return something for both the Value Extractor and the Exclusion Extractor.


  1. After clicking the ellipsis button for the Patter Match and the "Exclusion Extractor" window pops up, select the second page of the W-4 document from the TEST BATCH window.
  2. In the top right corner of the Document Viewer, click the drop down to change to the Text View. You should then be able to see what text was recognized by Grooper on the page.
  3. At the top of the document we can see the title "Form W-4 (2015)" followed immediately by "Page 2". We can use this to help improve our separation.


  1. In the screenshot below we have added a Value Pattern to return the title and "Page 2" from the document using the following pattern: Form W-4 \(2015\) Page 2.
  2. Click "OK" to apply the changes to the Exclusion Extractor.


  1. Click "OK" to save changes.


  1. Now we can test our separation again by going to the "Activity Tester" tab again.
  2. Highlight the Batch Folder containing the pages you wish to separate.
  3. Click on the play button in the top right corner of the TEST BATCH window to test separation.


  1. Now the W-4 is separating appropriately, with both pages in a single folder.

Narrowing Results

It is often not enough to just list out the names of the titles in a List Match and then add an Exclusion Extractor. Sometimes you need to refine your extractors otherwise you could get unwanted results. In this section we're going to show just a few ways to improve your extraction to make sure Grooper only separates at the first page of a document.

  1. First, we are going to take a look at a Value Reader to learn techniques to narrow down extractor results.
  2. Like in the previous examples, we're going to be using a List Match. Click the ellipsis button to the right of the Extractor property after selecting List Match.

Using Prefix Patterns: New Line/Beginning of Sring

  1. In this example we already have several Local Entries collecting the titles of documents.
  2. We're going to go through the Batch and refine our extractor based on what results we get.
  3. The title on the first document we're going to look at seems to be returning just fine so we don't need to make any changes.


  1. However, we are getting results on page 2 of the same document. We do not want this as a result because we do not want separation to occur here.


  1. By entering in a Prefix Pattern (located under Local Entries), we can narrow down our results. By entering in \n|^ as a prefix, Grooper is now only looking for results that begin with a new line or the beginning of string.
  2. Now we are not returning any results on the second page.


  1. If we go back to the first page of the document, we can see that we are still returning the title of the document because it starts at the beginning of string.

Using Prefix Patterns: Tabs

  1. Here we are not getting a result for the title despite the words being included in the Local Entries.
  2. That is because we have something that comes before the titel. It is not preceeded by a new line or beginning of string.


  1. We have added a tab character to the Prefix Pattern [\n\t]|^ to capture the tab before the text starts (remember to enable tabs under the "Properties" tab).
  2. Now the title is being returned just like we expect.

Region Settings

  1. On the next document we have an issue where we are returning a title located at the bottom of the second page of the document.


  1. Click over to the "Properties" tab.
  2. Click on the ellipsis icon to the right of the Result Filter property.


  1. We are going to limit the area of extraction by setting a region. Click on the ellipsis icon to the right of the Region property.


  1. When the "Region" window pops up, click on the marquee tool icon located at the top of the Document Viewer directly right of the zoom icon.
  2. In the Document Viewer, draw a box around the area where you want the extractor to look for values. In this case we only want Grooper looking at the top half of each page.
  3. The coordinates of the drawn box will update in the top left of the window. Feel free to manually adjust the numbers.
  4. Click "OK" when finished.


  1. Now the text at the bottom of the page should not be returned.

Vertical Wrap

  1. On this page we are only getting part of the titel despite having the full title as part of our Local Entries list.
    • Grooper would still separate here since we are getting a result, but let's assume that we want to get the full title.


  1. Click over to the "Properties" tab.
  2. Enable Vertical Wrap by clicking the check box to the right of the property.
  3. Now the full text of the title should be returned.