2023.1:EPI Separation (Separation Provider)

From Grooper Wiki
Revision as of 08:01, 5 April 2024 by Rpatton (talk | contribs) (icons // via Wikitext Extension for VSCode)

This article is about an older version of Grooper.

Information may be out of date and UI elements may have changed.

20252023.1

WIP

This article is a work-in-progress or created as a placeholder for testing purposes. This article is subject to change and/or expansion. It may be incomplete, inaccurate, or stop abruptly.

This tag will be removed upon draft completion.

The EPI Separation provider uses embedded page information ("EPI") to separate loose pages into document folders. A Data Extractor is used to find page numbers from the text on a page and Grooper uses this information to separate the pages.

About

For this Separation Proivder, an Extractor is used to find page numbers from the text on a page. The extractor must define the page number as group "PageNo" in the regular expression (regex) pattern. If the page number is formatted as Page X of Y (Page 1 of 3) then a second group must be defined as "PageCount" in its regular expression pattern.

The pattern Page (?<PageNo>\d+) of (?<PageCount>\d+) would group the "1" and "3" of our earlier example properly.



If the value of PageNo is 1, a new folder is created. As long as each subsequent page's PageNo value follows in sequence, they are included in the folder.

If the page is out of sequence (or the extractor fails to produce a result), it is left as a loose page. If this is not the desired result, the Miss Disposition property can be configured to append all subsequent loose pages to the previous Document Folder.

How To

Setting the Provider

  1. Add a Separate Batch Process Step to a Batch Process.
  2. Set the Provider property to EPI Separation.
  3. Set the Value Extractor. In this example we are setting the Value Extractor to a Reference and then referencing a Data Type.


Using Capture Groups

The Value Extractor needs to be set to something that will return the page numbers needed for EPI Separation. For Grooper to understand that a number needs to be used for EPI Separation we need to contain that extracted number in a specifically named Capture Group.

  • A capture group named PageNo indicates the number being used for the EPI Separation or the actual page number.
  • A capture group named PageCount indicates the total number of pages in a document, if that number is present on a page.

Without being in an appropriately named Capture Group, Grooper won't know what to do with the extracted information.

  1. The Data Type in this example has multiple Value Readers as children. Each Value Reader is configured with a Pattern Match to return the page number format for each page in this Batch.
  2. The regex pattern Page:? (?<pageNo>/d+) is collecting the page numbers from the first document in the Batch. The number we want to use for separation is contained within the Capture Group "PageNo".
  3. "Page: 1" is being returned from the first page of the Batch.


  1. The second Value Reader child of the Data Type in this example is configured to return a different page number format.
  2. The regex pattern Page:? (?<PageNo>\d+) of (?<PageCount>\d+) collects two numbers.
    • The actual page number is contained within the "PageNo" Capture Group.
    • The total number of pages in the document is contained within the "PageCount" Capture Group.
  3. "Page: 1 of 4" is returned with this regex pattern.


Using Lexicons