EPI Separation (Separation Provider)

From Grooper Wiki

This article was migrated from an older version and has not been updated for the current version of Grooper.

This tag will be removed upon article review and update.

This article is about the current version of Grooper.

Note that some content may still need to be updated.

2025 2023.1

The EPI Separation Separation Provider uses embedded page information ("EPI") to Separate loose pages into document folders. A Data Extractor is used to find page numbers from the text on a page and Grooper uses this information to separate the pages.

You may download the ZIP(s) below and upload it into your own Grooper environment (version 2023.1). The first contains one or more Batches of sample documents. The second contains one or more Projects with resources used in examples throughout this article.

About

For this Separation Provider, an Extractor is used to find page numbers from the text on a page. The extractor must define the page number as group "PageNo" in the regular expression (regex) pattern. If the page number is formatted as Page X of Y (Page 1 of 3) then a second group must be defined as "PageCount" in its regular expression pattern.

The pattern Page (?<PageNo>\d+) of (?<PageCount>\d+) would group the "1" and "3" of our earlier example properly.



If the value of PageNo is 1, a new folder is created. As long as each subsequent page's PageNo value follows in sequence, they are included in the folder.

If the page is out of sequence (or the extractor fails to produce a result), it is left as a loose page. If this is not the desired result, the Miss Disposition property can be configured to append all subsequent loose pages to the previous Document Folder.

How To

Setting the Provider

  1. Add a Separate Batch Process Step to a Batch Process.
  2. Set the Provider property to EPI Separation.
  3. Set the Value Extractor. In this example we are setting the Value Extractor to a Reference and then referencing a Data Type.


Using Capture Groups

The Value Extractor needs to be set to something that will return the page numbers needed for EPI Separation. For Grooper to understand that a number needs to be used for EPI Separation we need to contain that extracted number in a specifically named Capture Group.

  • A capture group named PageNo indicates the number being used for the EPI Separation or the actual page number.
  • A capture group named PageCount indicates the total number of pages in a document, if that number is present on a page.

Without being in an appropriately named Capture Group, Grooper won't know what to do with the extracted information.

FYI

While typing out your regex pattern, pressing Ctrl+G will insert a capture group into your pattern. Then all you have to do is enter in a name for your capture group.

The PageNo Capture Group

  1. The Data Type in this example has multiple Value Readers as children. Each Value Reader is configured with a Pattern Match to return the page number format for each page in this Batch.
  2. The regex pattern Page:? (?<pageNo>/d+) is collecting the page numbers from the first document in the Batch. The number we want to use for separation is contained within the Capture Group "PageNo".
  3. "Page: 1" is being returned from the first page of the Batch.


The PageCount Capture Group

When you have page numbers on a document that indicate the total number of pages (such as "page 1 of 4"), you can use the PageCount capture group name to return that value.

You might ask yourself why we would want to collect that information. There are times you might have a page with poor OCR or perhaps the page number is missing altogether. Using the PageCount capture group helps Grooper understand how many pages to expect in a document, so even if there are OCR errors, it has a higher probability of still separating appropriately.

In the example below, we show you how to use the PageCount capture group.

  1. The second Value Reader child of the Data Type in this example is configured to return a different page number format.
  2. The regex pattern Page:? (?<PageNo>\d+) of (?<PageCount>\d+) collects two numbers.
    • The actual page number is contained within the "PageNo" Capture Group.
    • The total number of pages in the document is contained within the "PageCount" Capture Group.
  3. "Page: 1 of 4" is returned with this regex pattern.


Using Lexicons

In the example below, we have a document in the Batch where the page number is spelled out (one, two, three, etc.) instead of a numeric value (1, 2, 3, etc.). Grooper is unable to understand the meaning of written words, however, we can use a Lexicon to translate the words to numeric values.

Looking at the Lexicon

  1. In the example below, our Value Reader is collecting the page numbers that are spelled out rather than numeric values.
  2. We still use the same regex capture group of "PageNo" for the number, but we need to change the words to something Grooper can understand.


  1. The NumberWords Lexicon in the node tree has been configured to help translate these words to numbers. Select the Lexicon.
  2. In the right panel, we can see that each word is matched up with it's corresponding numerical value in a list using the syntax one=1. This Lexicon can be used to translate the words like "one" to the numerical value like "1", which Grooper can use to separate the pages.


Applying the Lexicon

  1. Go back to the Value Reader in the node tree.
  2. Click on the "Tester" tab.
  3. Click on the "Properties" tab.
  4. Click the ellipsis icon to the right of the Group Options property under the "LOOKUP" category.


  1. On the left side of the "Group Options" window that pops up, select the relevant "Group Name" if not already selected.
  2. Open up the Vocabulary property.
  3. Click on the ellipsis icon to the right of the Included Lexicons property.


  1. Navigate through the folders in the "Included Lexicons" window that pops up and click the check box next to the Lexicon you wish to apply.
  2. Click "OK" at the top right of the window to apply the Lexicon.


  1. Finally, open up the Lookup Options property under "LOOKUP SETTINGS".
  2. Click the checkbox next to the Translate property.
    • This is required for Grooper to translate the text to numerical values.
  3. Click "OK" in the top right of the window to apply changes.