2021:Split Pages (Activity)
Split Pages is an activity that will split a multi-page PDF or TIF document into individual pages.
When applied to a Batch Folder with an attached PDF or TIF file, the Split Pages activity will create a Batch Page object for each page in the file, which are created as children of the Batch Folder.
About
|
|
|
|
We can also process this document at this point. We can apply Grooper activities at the folder level, to this Batch Folder (by setting a Batch Process Step's Scope property to Folder). An activity running on the folder level can manipulate the content in the attached file. For example, if we ran the Recognize activity at the folder level, it would obtain text data from the attached PDF file. |
|
|
|
Why Split Pages?
There are two reasons to use the Split Pages activity to split out pages from a multipage document.
- To apply activities that require Batch Page objects to function.
- Namely the Image Processing and Separate activities.
- To increase compute efficiency.
- A Batch Folder is a single object, which can be processed by a single processing thread. If you split out the attached document's pages, each page becomes its own object in the Batch. Each page can also only be processed by a single thread, but with multiple page objects now present, multiple threads can now be used to process the document (one for each page).
Splitting Pages for Specific Activities
Certain Grooper activities require Batch Page objects by design.
- The Separate activity separates loose pages into folders. If there's no Batch Page objects, there's nothing to separate.
- The Image Processing activity applies an IP Profile to mutate a page's image in order to clean it up before OCR processing during the Recognize activity. In all but the narrowest of use cases, the Image Processing activity must process Batch Page objects, not Batch Folders. If there's no Batch Page objects, there's nothing for the IP Profile to clean up.
|
So, first we need to run the Split Pages activity to add page objects we can manipulate. Then, we can run the Separate activity to separate those pages into folders. |
|
|
|
|
|
|
Splitting Pages to Increase Efficiency
Often a Split Pages activity is added as one of the first steps in the Batch Process to increase processing efficiency. Why? To take full advantage of your systems multithreaded processing capabilities.
Many activities can run on the folder or page level and produce the same end result. For example, the Recognize activity will perform OCR and/or native text extraction to obtain text data for a document. This activity can run either on the Batch Folder or Batch Page level.
When running on a Batch Folder level, it will obtain text data from the attached PDF or TIF file. When dealing with large multi-page files, you can encounter a processing bottleneck, leading to increased processing times, if you're only running activities of the folder level.
|
If you apply the activity at the folder level, each thread will process a single Batch Folder.
|
|
|
If you apply the Recognize activity at the page level, each thread will process a single Batch Page.
|
When should I use Split Pages to increase efficiency?
You will get the greatest benefit from the Split Pages activity if you are processing larger multipage PDF or TIF files.
- The more pages there are in the source file, generally the greater the efficiency reward.
- Furthermore, the more processing able to be done at the page level, the more efficiency you'll reap from splitting pages. For example, it will be advantageous for you to use Split Pages to make a Recognize step in your Batch Process more efficient. It will be even more advantageous if you have both a Recognize step and an Image Processing step in your Batch Process, as both those activities can be executed at the page level.
You will receive less benefit if your imported files are small, one to two page PDFs or TIF files.
- In that case the processing effort to split the pages may not be made up by increased parallelism in subsequent steps in your Batch Process.
Are there any drawbacks to the Split Pages activity?
Be aware the Split Pages activity will necessarily eat into your Grooper file store's storage space. By creating a new page object, you're effectively making a copy of the PDF or TIF's page. That page's image must be stored somewhere (your Grooper Repository's file store location).
- If your Grooper file store is severely limited in size, you may not have the digital space necessary to split out several large digital file's pages.
- Also, keep in mind SSD storage is faster than HDD storage. You may experience latency if your Grooper file store's working storage is itself a slower storage medium.







