2021:Split Pages (Activity): Difference between revisions
Dgreenwood (talk | contribs) Created page with "<blockquote style="font-size:125%"> '''Split Pages''' is an activity that will split a multi-page PDF or TIF document into individual pages. </blockquote> When applied to a '..." |
Dgreenwood (talk | contribs) |
||
| Line 40: | Line 40: | ||
# To apply activities that ''require'' '''Batch Page''' objects to function. | # To apply activities that ''require'' '''Batch Page''' objects to function. | ||
#* | #* Namely the '''Image Processing''' and '''Separate''' activities. | ||
# To increase compute efficiency. | # To increase compute efficiency. | ||
#* A '''Batch Folder''' is a single object, which can be processed by a single processing thread. If you split out the attached document's pages, each page becomes its own object in the '''Batch'''. Each page can also only be processed by a single thread, but with multiple page objects now present, multiple threads can now be used to process the document (one for each page). | #* A '''Batch Folder''' is a single object, which can be processed by a single processing thread. If you split out the attached document's pages, each page becomes its own object in the '''Batch'''. Each page can also only be processed by a single thread, but with multiple page objects now present, multiple threads can now be used to process the document (one for each page). | ||
=== Splitting Pages for Specific Activities === | |||
Certain Grooper activities ''require'' '''Batch Page''' objects by design. | |||
* The '''Separate''' activity separates loose pages into folders. If there's no '''Batch Page''' objects, there's nothing to separate. | |||
* The '''Image Processing''' activity applies an '''IP Profile''' to mutate a page's image in order to clean it up before OCR processing during the '''Recognize''' activity. In all but the narrowest of use cases, the '''Image Processing''' activity must process '''Batch Page''' objects, not '''Batch Folders'''. If there's no '''Batch Page''' objects, there's nothing for the '''IP Profile''' to clean up. | |||
{|cellpadding=10 cellspacing=5 | |||
|valign=top style="width:40%"| | |||
<br> | |||
For example, one common situation occurs when importing PDFs that are "packet documents". Contained within the single multi-page file are multiple different documents. | |||
This PDF is an application packet for a scholarship program. Within this six page PDF, there are actually five separate documents: | |||
* The application form | |||
* A proposal summary | |||
* A resume | |||
* An essay | |||
* A recommendation letter | |||
We can't separate this multipage PDF into these component documents without its pages split out. So, first we need to run the '''Split Pages''' activity to add page objects we can manipulate. Then, we can run the '''Separate''' activity to separate those pages into folders. | |||
|valign=top| | |||
[[File:Split-pages-about-04.png]] | |||
|- | |||
|valign=top| | |||
<br> | |||
'''Split Pages''' adds child pages to the folder, one '''Batch Page''' for each page in the PDF attached to the '''Batch Folder'''. | |||
|valing=top| | |||
[[File:Split-pages-about-05.png]] | |||
|- | |||
|valign=top| | |||
<br> | |||
With pages present, now the '''Separate''' has objects it can manipulate, establishing folder separation points and placing pages in sub-folders. | |||
|valign=top| | |||
[[File:Split-pages-about-06.png]] | |||
|} | |||
=== Splitting Pages to Increase Efficiency === | |||
== Bursting and Rendering PDFs == | == Bursting and Rendering PDFs == | ||
Revision as of 14:13, 25 April 2022
Split Pages is an activity that will split a multi-page PDF or TIF document into individual pages.
When applied to a Batch Folder with an attached PDF or TIF file, the Split Pages activity will create a Batch Page object for each page in the file, which are created as children of the Batch Folder.
About
|
Split Pages if often a critical component to a Batch Process where documents are imported into new Batches from a digital source (as opposed to scanned paper documents). When a digital file is imported into Grooper, two things happen:
|
|
|
We can also process this document at this point. We can apply Grooper activities at the folder level, to this Batch Folder (by setting a Batch Process Step's Scope property to Folder). An activity running on the folder level can manipulate the content in the attached file. For example, if we ran the Recognize activity at the folder level, it would obtain text data from the attached PDF file. |
|
|
|
Why Split Pages?
There are two reasons to use the Split Pages activity to split out pages from a multipage document.
- To apply activities that require Batch Page objects to function.
- Namely the Image Processing and Separate activities.
- To increase compute efficiency.
- A Batch Folder is a single object, which can be processed by a single processing thread. If you split out the attached document's pages, each page becomes its own object in the Batch. Each page can also only be processed by a single thread, but with multiple page objects now present, multiple threads can now be used to process the document (one for each page).
Splitting Pages for Specific Activities
Certain Grooper activities require Batch Page objects by design.
- The Separate activity separates loose pages into folders. If there's no Batch Page objects, there's nothing to separate.
- The Image Processing activity applies an IP Profile to mutate a page's image in order to clean it up before OCR processing during the Recognize activity. In all but the narrowest of use cases, the Image Processing activity must process Batch Page objects, not Batch Folders. If there's no Batch Page objects, there's nothing for the IP Profile to clean up.
|
|
|
|
|
|
|
|





