2023.1:EPI Separation (Separation Provider): Difference between revisions

From Grooper Wiki
draft // via Wikitext Extension for VSCode
icons // via Wikitext Extension for VSCode
Line 16: Line 16:
== About ==
== About ==


For this '''''Separation Proivder''''', a [[Data Extractor]] is used to find page numbers from the text on a page. The extractor must define the page number as group "PageNo" in the regular expression (regex) pattern. If the page number is formatted as Page X of Y (Page 1 of 3) then a second group must be defined as "PageCount" in its regular expression pattern.  
For this '''''[[Separation Proivder]]''''', an Extractor is used to find page numbers from the text on a page. The extractor must define the page number as group "PageNo" in the regular expression (regex) pattern. If the page number is formatted as Page X of Y (Page 1 of 3) then a second group must be defined as "PageCount" in its regular expression pattern.  


The pattern <code>Page (?<PageNo>\d+) of (?<PageCount>\d+)</code> would group the "1" and "3" of our earlier example properly.
The pattern <code>Page (?<PageNo>\d+) of (?<PageCount>\d+)</code> would group the "1" and "3" of our earlier example properly.
Line 28: Line 28:
If the value of PageNo is 1, a new folder is created.  As long as each subsequent page's PageNo value follows in sequence, they are included in the folder.   
If the value of PageNo is 1, a new folder is created.  As long as each subsequent page's PageNo value follows in sequence, they are included in the folder.   


If the page is out of sequence (or the extractor fails to produce a result), it is left as a loose page. If this is not the desired result the '''''Miss Disposition''''' property can be configured to append all subsequent loose pages that are not separated to the previous '''Document Folder'''.  
If the page is out of sequence (or the extractor fails to produce a result), it is left as a loose page. If this is not the desired result, the '''''Miss Disposition''''' property can be configured to append all subsequent loose pages to the previous '''Document Folder'''.  


== How To ==
== How To ==
Line 34: Line 34:
=== Setting the Provider ===
=== Setting the Provider ===


# Add a '''''Separate''''' '''Batch Process Step''' to a '''Batch Process'''.  
# Add a '''''[[Separate (Activity)|Separate]]''''' [[image:GrooperIcon_BatchProcessStep.png]]'''Batch Process Step''' to a [[image:GrooperIcon_BatchProcess.png]]'''[[Batch Process]]'''.  
# Set the '''''Provider''''' property to ''EPI Separation''.  
# Set the '''''Provider''''' property to ''EPI Separation''.  
# Set the '''''Value Extractor'''''. In this example we are setting the '''''Value Extractor''''' to a ''Reference'' and then referencing a '''Data Type'''.  
# Set the '''''Value Extractor'''''. In this example we are setting the '''''Value Extractor''''' to a ''Reference'' and then referencing a [[image:GrooperIcon_DataType.png]]'''Data Type'''.  


[[File:2023.1 EPI-Separation 02 01 Setting-the-Provider 01.png]]
[[File:2023.1 EPI-Separation 02 01 Setting-the-Provider 01.png]]
Line 48: Line 48:
Without being in an appropriately named Capture Group, Grooper won't know what to do with the extracted information.
Without being in an appropriately named Capture Group, Grooper won't know what to do with the extracted information.


# The '''Data Type''' in this example has multiple '''Value Readers''' as children. Each '''Value Reader''' is configured with a '''''Pattern Match''''' to return the page number format for each page in this '''Batch'''.  
# The '''Data Type''' in this example has multiple [[image:GrooperIcon_ValueReader.png]]'''Value Readers''' as children. Each '''Value Reader''' is configured with a '''''Pattern Match''''' to return the page number format for each page in this [[image:GrooperIcon_Batch.png]]'''Batch'''.  
# The regex pattern <code>Page:? (?<pageNo>/d+)</code> is collecting the page numbers from the first document in the '''Batch'''. The number we want to use for separation is contained within the Capture Group "PageNo".  
# The regex pattern <code>Page:? (?<pageNo>/d+)</code> is collecting the page numbers from the first document in the '''Batch'''. The number we want to use for separation is contained within the Capture Group "PageNo".  
# "Page: 1" is being returned from the first page of the '''Batch'''.
# "Page: 1" is being returned from the first page of the '''Batch'''.
Line 55: Line 55:




# The second '''Value Reader''' child of the '''Document Type''' in this example is configured to return a different page number format.  
# The second '''Value Reader''' child of the '''Data Type''' in this example is configured to return a different page number format.  
# The regex pattern <code>Page:? (?<PageNo>\d+) of (?<PageCount>\d+)</code> collects two numbers.  
# The regex pattern <code>Page:? (?<PageNo>\d+) of (?<PageCount>\d+)</code> collects two numbers.  
#* The actual page number is contained within the "PageNo" Capture Group.
#* The actual page number is contained within the "PageNo" Capture Group.

Revision as of 08:01, 5 April 2024

This article is about an older version of Grooper.

Information may be out of date and UI elements may have changed.

20252023.1

WIP

This article is a work-in-progress or created as a placeholder for testing purposes. This article is subject to change and/or expansion. It may be incomplete, inaccurate, or stop abruptly.

This tag will be removed upon draft completion.

The EPI Separation provider uses embedded page information ("EPI") to separate loose pages into document folders. A Data Extractor is used to find page numbers from the text on a page and Grooper uses this information to separate the pages.

About

For this Separation Proivder, an Extractor is used to find page numbers from the text on a page. The extractor must define the page number as group "PageNo" in the regular expression (regex) pattern. If the page number is formatted as Page X of Y (Page 1 of 3) then a second group must be defined as "PageCount" in its regular expression pattern.

The pattern Page (?<PageNo>\d+) of (?<PageCount>\d+) would group the "1" and "3" of our earlier example properly.



If the value of PageNo is 1, a new folder is created. As long as each subsequent page's PageNo value follows in sequence, they are included in the folder.

If the page is out of sequence (or the extractor fails to produce a result), it is left as a loose page. If this is not the desired result, the Miss Disposition property can be configured to append all subsequent loose pages to the previous Document Folder.

How To

Setting the Provider

  1. Add a Separate Batch Process Step to a Batch Process.
  2. Set the Provider property to EPI Separation.
  3. Set the Value Extractor. In this example we are setting the Value Extractor to a Reference and then referencing a Data Type.


Using Capture Groups

The Value Extractor needs to be set to something that will return the page numbers needed for EPI Separation. For Grooper to understand that a number needs to be used for EPI Separation we need to contain that extracted number in a specifically named Capture Group.

  • A capture group named PageNo indicates the number being used for the EPI Separation or the actual page number.
  • A capture group named PageCount indicates the total number of pages in a document, if that number is present on a page.

Without being in an appropriately named Capture Group, Grooper won't know what to do with the extracted information.

  1. The Data Type in this example has multiple Value Readers as children. Each Value Reader is configured with a Pattern Match to return the page number format for each page in this Batch.
  2. The regex pattern Page:? (?<pageNo>/d+) is collecting the page numbers from the first document in the Batch. The number we want to use for separation is contained within the Capture Group "PageNo".
  3. "Page: 1" is being returned from the first page of the Batch.


  1. The second Value Reader child of the Data Type in this example is configured to return a different page number format.
  2. The regex pattern Page:? (?<PageNo>\d+) of (?<PageCount>\d+) collects two numbers.
    • The actual page number is contained within the "PageNo" Capture Group.
    • The total number of pages in the document is contained within the "PageCount" Capture Group.
  3. "Page: 1 of 4" is returned with this regex pattern.


Using Lexicons