Split Pages (Activity)

From Grooper Wiki
(Redirected from Split Pages)

This article was migrated from an older version and has not been updated for the current version of Grooper.

This tag will be removed upon article review and update.

This article is about the current version of Grooper.

Note that some content may still need to be updated.

2025 20232021

Multi-page PDF and TIF files come into Grooper as files attached to single folder Batch Folders. Split Pages is an Activity that creates child contract Batch Pages for each page in the PDF or TIF. This allows Grooper to process and handle these pages as individual objects.

You may download and import the file below into your own Grooper environment (version 2023). There is a Batch with the example document discussed in this tutorial.

About


Split Pages if often a critical component to a Batch Process where documents are imported into new Batches from a digital source (as opposed to scanned paper documents). When a digital file is imported into Grooper, two things happen:

  1. A Batch Folder object is created in the Batch
  2. The digital file is attached to the Batch Folder.


At this point, the document's content is accessible at the folder level only.

  • For example, we can select this folder and we can navigate through the pages in the attached multipage PDF using the page navigator in a Document Viewer.

We can also process this document at this point. We can apply Grooper activities at the folder level, to this Batch Folder (by setting a Batch Process Step's Scope property to Folder). An activity running on the folder level can manipulate the content in the attached file. For example, if we ran the Recognize activity at the folder level, it would obtain text data from the attached PDF file.


The Split Pages activity allows us to process the document's content at the page level.

  1. When Split Pages is applied to a Batch Folder it will create child Batch Page objects from an attached PDF or TIF file.
    • One Batch Page for each page in the multipage PDF or TIF.
  2. Now that we have individual objects in the Batch for each page in the PDF or TIF, we can then select and process each page individually.

Why Split Pages?

There are two reasons to use the Split Pages activity to split out pages from a multipage document.

  1. To apply activities that require Batch Page objects to function.
    • Namely the Image Processing and Separate activities.
  2. To increase compute efficiency.
    • A Batch Folder is a single object, which can be processed by a single processing thread. If you split out the attached document's pages, each page becomes its own object in the Batch. Each page can also only be processed by a single thread, but with multiple page objects now present, multiple threads can now be used to process the document (one for each page).

Splitting Pages for Specific Activities

Certain Grooper activities require Batch Page objects by design.

  • The Separate activity separates loose pages into folders.
    • If there's no Batch Page objects, there's nothing to separate.
  • The Image Processing activity applies an IP Profile to mutate a page's image in order to clean it up before OCR processing during the Recognize activity.
    • In all but the narrowest of use cases, the Image Processing activity must process Batch Page objects, not Batch Folders. If there's no Batch Page objects, there's nothing for the IP Profile to clean up.


For example, one common situation occurs when importing PDFs that are "packet documents". Contained within the single multi-page file are multiple different documents.


This PDF is an application packet for a scholarship program. Within this six page PDF, there are actually five separate documents:

  • The application form
  • A proposal summary
  • A resume
  • An essay
  • A recommendation letter


Ultimately, we need to run the Separate activity to establish separate points where Grooper can place loose pages into inserted document folders. However, we can't separate this multipage PDF into these component documents without its pages split out. You can't separate pages if you don't have pages to separate!

So, first we need to run the Split Pages activity to add page objects we can manipulate. Then, we can run the Separate activity to separate those pages into folders.


Split Pages adds child pages to the folder, one Batch Page for each page in the PDF attached to the Batch Folder.


With pages present, now the Separate has objects it can manipulate, establishing folder separation points and placing pages in sub-folders.

Splitting Pages to Increase Efficiency

Often a Split Pages activity is added as one of the first steps in the Batch Process to increase processing efficiency. Why? To take full advantage of your systems multithreaded processing capabilities.

Many activities can run on the folder or page level and produce the same end result. For example, the Recognize activity will perform OCR and/or native text extraction to obtain text data for a document. This activity can run either on the Batch Folder or Batch Page level.

When running on a Batch Folder level, it will obtain text data from the attached PDF or TIF file. When dealing with large multi-page files, you can encounter a processing bottleneck, leading to increased processing times, if you're only running activities of the folder level.


Imagine you're running Recognize and you have six processing threads available. Each thread can process one object in the Batch.

If you apply the activity at the folder level, each thread will process a single Batch Folder.

  • So, one thread works to Recognize the document folder, leaving the remaining five with nothing to do.


However, if you first apply the Split Pages activity, you will have one page object for each page in the attached PDF or TIF. In this case, the attached PDF was six pages. So, we have six child Batch Pages subsequent activities can process.

If you apply the Recognize activity at the page level, each thread will process a single Batch Page.

  • All six threads are now utilized.
  • Instead of just a single thread doing all the work, its spread across six, vastly improving the time it takes to process the single parent document.

When should I use Split Pages to increase efficiency?

You will get the greatest benefit from the Split Pages activity if you are processing larger multipage PDF or TIF files.

  • The more pages there are in the source file, generally the greater the efficiency reward.
  • Furthermore, the more processing able to be done at the page level, the more efficiency you'll reap from splitting pages. For example, it will be advantageous for you to use Split Pages to make a Recognize step in your Batch Process more efficient. It will be even more advantageous if you have both a Recognize step and an Image Processing step in your Batch Process, as both those activities can be executed at the page level.

You will receive less benefit if your imported files are small, one to two page PDFs or TIF files.

  • In that case the processing effort to split the pages may not be made up by increased parallelism in subsequent steps in your Batch Process.

Are there any computing drawbacks to the Split Pages activity?

Be aware the Split Pages activity will necessarily eat into your Grooper file store's storage space. By creating a new page object, you're effectively making a copy of the PDF or TIF's page. That page's image must be stored somewhere (your Grooper Repository's file store location).

  • If your Grooper file store is severely limited in size, you may not have the digital space necessary to split out several large digital file's pages.
  • Also, keep in mind SSD storage is faster than HDD storage. You may experience latency if your Grooper file store's working storage is itself a slower storage medium.
Click here to return to the top of this section

Bursting and Rendering PDFs

There are a few different ways to split pages from PDF documents in Grooper. Being aware of how the page objects are created can greatly increase your processing efficiency in many ways.

First, you must be aware of the differences between "image-based" PDF page and "text-based" (also called "true" or "native-text") PDF pages.

Image-Based

An image-based PDF page's visible content is defined by a single image. For example, this could be a scanned paper page saved to a PDF page. This is the realm of raster graphics. A mosaic of pixels constructs the image presented to the reader on a computer screen.

  • This includes "Single Image" as well as "Searchable" PDF pages.
    • "Single Image" pages are totally defined by one, single image on the page.
    • "Searchable" pages are those with invisible text overlaid on a single image.

Text-Based

A text-based PDF page's visible content is defined by digitally authored content. For example, a PDF form created from scratch in Adobe Acrobat or a Word file printed to a PDF format would both be text-based PDF pages. This is the realm of vector graphics. Text and graphics are rendered visually by mathematical formulas.

  • This includes "Text Only" as well as "Mixed" PDF pages.
    • "Text Only" pages are totally defined by vector graphics.
    • "Mixed" pages are those with a combination of vector graphics and raster graphics. Part of the page may be digital text, but another may be an image. For example, a digital form with an embedded image of a company's logo would be a Mixed PDF page.

Depending on how the Split Pages activity is configured, one of three things is going to happen:

1. All page objects will be created as PDF pages.

2. All page objects will be created as JPEG images.

3. Page objects will be conditionally created as PDF pages or JPEG images.

Text-based pages will be created as PDF pages.
Image-based pages will be created as JPEG images.

You will use the PDF Options properties to achieve your desired result. How you configure these properties will determine whether the page object is created as a PDF page or a JPEG image.

Depending on your goal, you will enable or disable a combination of the following properties.

  1. PDF Page Extraction
    • This property determines whether or not split page objects are created as PDFs.
    • Setting this property to Auto, Selective or Standard will enable PDF extraction.
      • All text-based PDF pages will be split as PDF pages.
      • Image-based PDF pages will be split as JPEG page objects if Image Bursting is Enabled, or PDF page objects if Image Bursting is Disabled.
    • Setting this property to Disabled will force all PDF pages to be split as JPEG pages.
  2. Image Bursting
    • When PDF Page Extraction is enabled, this property determines how image-based PDF pages are generated as JPEG images.
    • The page's whole image is extracted (or "burst") from the PDF. This process reconstructs the image pixel-by-pixel. It is an unaltered copy, extracting the raw image resource with no alterations to the pixel values or original resolution.
  3. Rendering
    • When PDF Page Extraction is Disabled, this property determines how all PDF pages are generated as JPEG images.
    • Rendering an image differs from bursting in that a new image is drawn on the new page at a specified resolution.

Configuration Examples


Imagine you've imported a PDF file to a Batch.

The PDF is attached to the Batch Folder and we want to split out its pages.


The PDF has five pages. Some of them are text-based. Others are image-based.

How you configure the PDF Page Extraction, Image Bursting, and Rendering properties will determine whether or not the text-based and image-based pages are split as PDF or JPEG page objects.

Default Settings - Bursting Images

The Split Pages activity's default settings presume you want to extract text-based PDF pages and burst image-based PDF pages.

  1. PDF Page Extraction is set to Auto
    • FYI: Choosing Selective and Standard also enables PDF page extraction.
      • Auto uses Selective mode if two or more pages share a resource dictionary (fonts, bitmaps, images and other resources needed to draw the page). This will copy only the resources used for drawing operations.
        • This is more efficient for poorly formed PDFs where all pages use a single, shared resource dictionary, instead of using a unique dictionary for each individual page, listing only the resources used for each page.
      • Otherwise, Auto will use the Standard mode, which copies all resources listed in the page's resource dictionary.
      • In most cases, Auto ensures the most efficient method is used.
  2. Image Bursting is set to Enabled.
  3. Rendering is set to Enabled.
    • FYI: Even though the property reads as Enabled, it's not actually doing anything in this case. With PDF Page Extraction enabled and Image Bursting enabled, Rendering is effectively disabled whether or not the property is set to Enabled or Disabled.


With this configuration, page objects will be conditionally created as PDF pages or JPEG images.

  • Image-based pages will be split as JPEG images.
  • Text-based pages will be split as PDF pages


Grooper will create a Batch Page for each page in the attached PDF. For text-based pages in the attached PDF, the Batch Page will be a single-page PDF copy of the corresponding page in the original PDF. For image-based pages in the attached PDF, the Batch Page will be a JPEG image copied directly from the corresponding page in the original PDF.

Please be aware there is one exception to this rule.

A "Multi-Image" PDF page is an image-based PDF page whose visible content is defined by multiple images, but otherwise has no digitally authored content. Linearized PDF pages are typically multi-image, which allows for faster load times when viewing PDFs over the internet.

Split Page's internal logic looks for a single image making up a PDF's page content when determining if it's image-based. Therefore, Split Pages will not consider "Multi-Image" PDF pages to be image-based and split them as PDF pages (not JPEG images).

However, "Multi-Image" PDF pages' images can be burst using the Rasterize command. For more information, please visit the #Using the Rasterize Execute Command section of this article.

Splitting All Pages as PDFs - PDF Page Extraction ONLY

In certain cases, you may want to force PDF page extraction for all PDF pages, regardless if they are text-based or image-based.

  • For example, in order to make efficient use of the Rasterize command.

You will need to disable Image Bursting and Rendering' to do this. See below for this Split Pages configuration.

  1. PDF Page Extraction is enabled.
  2. Image Bursting is set to Disabled.
  3. Rendering is set to Disabled.
    • Technically speaking, if Rendering is set to Enabled, rendering will still be ignored, effectively disabling it. Enabling PDF Page Extraction supersedes Rendering.
    • However, it is still considered best practice to set this property to Disabled.


With this configuration, all page objects will be created as PDF pages.


Grooper will create a Batch Page for each page in the attached PDF. Each Batch Page will be a single-page PDF copy of the corresponding page in the attached PDF.

Splitting All Pages as JPEG images - Rendering

In certain cases, you may want to force all PDF pages to be rendered as images, regardless if they are text-based or image-based.

  • For example, in order to split damaged PDFs which otherwise cannot be split using normal methods. These PDFs may be able to be displayed, but not split using PDF page extraction. Rendering the pages as images will at least allow the page objects to be processed as images.

You can disable PDF Page Extraction and Image Bursting to do this. See below for this Split Pages configuration.

  1. PDF Page Extraction is set to Disabled.
  2. Image Bursting is set to Disabled.
  3. Rendering is set to Enabled.


FYI

If Image Bursting is set to Enabled, you will get a similar result in that all split pages will be created as JPEG images. However, there is a difference.

Bursting makes a perfect copy of the image resource in an image-based PDF page, using its native resolution. Rendering re-draws a new image at a resolution you, the user, defines (200 dpi by default).

Therefore, with Image Bursting disabled, Rendering allows you to normalize the resolution for all pages created. Disabling Image Bursting also allows you to define (and thus normalize) the color format and color depth for each page as well by configuring those settings in the Rendering properties.


With this configuration, all page objects will be created as JPEG pages.


Grooper will create a Batch Page for each page in the attached PDF. Each Batch Page will be a JPEG image re-drawn from the corresponding page in the original PDF. The JPEG image will be drawn using the resolution, color format, and color depth defined in the Rendering settings.

Why do I care?

Why should you care whether or not the page is split as a PDF or JPEG image? There are several reasons. Some are practical, relating to necessary processing requirements. Others relate to computing efficiency.

Practical Considerations

  1. I have text-based PDFs with purely native, digitally encoded text.
    • Text-based PDFs already have text data embedded in the page. Whereas images must be OCR'd to get text data.
    • If your documents are text-based, you more likely than not just want to extract the raw native text data from the PDF. There's no reason to OCR, if your documents already have good, machine readable text embedded in them.
    • Split as PDF pages to perform native text extraction during Recognize.
      • Ensure PDF Page Extraction is enabled.
  2. My PDF documents have a mix of text-based and image-based pages.
    • This is what Bursting is for. Bursting a PDF will create PDF page objects for text-based pages and JPEG page objects for image-based pages.
      • Ensure PDF Page Extraction is enabled and Image Bursting is Enabled.
  3. I have image-based PDF pages that require substantial permanent image processing cleanup.
    • The Image Processing activity performs permanent image cleanup on a digital image. With few exceptions, it is preferable for Image Processing to process a JPEG image, not a PDF page (See the #Image Processing Considerations section for more details).
    • Split image-based PDF pages as JPEG images to clean up pages with the Image Processing activity.
      • If PDF Page Extraction is enabled, ensure Image Bursting is Enabled.
      • If PDF Page Extraction is Disabled, ensure Rendering is Enabled.
  4. I have "Searchable" PDF pages and I want Grooper to re-OCR the image to get new text data.
    • You want to end up with a JPEG page object in this case.
    • This is also what Bursting is for. Searchable PDFs are considered image-based. Bursting these pages will generate a JPEG image for the page, ensuring the Recognize activity will perform OCR.
      • If PDF Page Extraction is enabled, ensure Image Bursting is Enabled.
      • If PDF Page Extraction is disabled, ensure Rendering is Enabled.
  5. I have "Searchable" PDF pages and I do not want Grooper to re-OCR the images. I want the embedded text overlay extracted.
    • You want to end up with a PDF page object in this case.
    • You MUST disable Bursting in this case. Disabling Bursting will ensure a PDF page object is created, with the embedded text data still present. This will allow the Recognize activity to extract the embedded text data.
      • Ensure PDF Page Extraction is enabled and Image Bursting is Disabled.
  6. I have text-based PDFs with malformed or semi-corrupted encoded text. Or, the PDF is damaged to the point it can't be split using normal PDF page extraction.
    • You may need to end up with all JPEG page objects in this case.
    • Sometimes text data can be corrupted. While a page appears readable to a human, the encoded text data may be a bunch of gibberish. Furthermore, some PDF authors will intentionally obfuscate text characters by swapping character glyphs as a kind of watermarking. For example, a printed "A" on the page may be encoded as a "£".
    • In both cases, while the printed characters look fine visually, the underlying text data is inaccurate. If the issue is bad enough, you may need to OCR the pages instead of extracting the embedded text data to get the most accurate text data from the document. Splitting text-based PDF pages as JPEG images will ensure Recognize will OCR the pages (Since they are JPEGs and not PDFs, there's no native text to extract. So Recognize is forced to use OCR).
      • Ensure PDF Page Extraction is Disabled and Rendering is Enabled.
  7. I have "Mixed" PDF pages that have a combination of native and printed text. I need text data for both.
    • You want to end up with a PDF page object in this case.
    • This is more of a Recognize issue than a Split Pages issue. The Recognize activity is designed to extract native text segments from a PDF and perform OCR from the image-based portions, even when both are present on the same page. You just need to make sure a PDF page object is created, not a JPEG object. The PDF page will have both the native digital text and the image embedded in its resources.
      • Ensure PDF Page Extraction is enabled.

Efficiency Considerations

If you have PDFs with predominately text-based pages, the default Split Pages configuration should be just fine. However, if you have large, multipage PDFs with image-based pages, you may want to consider using the Rasterize command to increase your processing efficiency.

In the next section, will discuss using Rasterize to increase processing efficiency when splitting large multipage image-based PDFs.

Using the Rasterize Execute Command

Generating image content takes time for your computer to do the drawing operations required to produce the image. Whether images are extracted from PDF pages using Image Bursting or re-drawn using Rendering, there will be a performance cost of some sort. The cost is higher the more pages there are in the document. The Rasterize command allows this process to be run multithreaded, increasing efficiency.

Instead of image bursting or rendering during the Split Pages activity, you will split all pages in the parent document as PDF pages. Then, you will burst or render the pages using the Execute activity and the Rasterize command.

  • So, Split Pages does the PDF page extraction. Then, Execute (using the Rasterize command) does the image bursting or rendering.

More or less, you're using two activities, Split Pages and Execute (Rasterize), to perform the same end result as Split Pages alone. How is this more efficient? Multithreading!


The Split Pages activity always runs on the folder level. That means only one of your computer's processing threads can work to split pages out of a PDF attached to a Batch Folder.

Imagine you have six processing threads. Each thread can process one object in the Batch, in this case a Batch Folder.

With large image-based PDF files with a high number of pages, bursting or rendering during Split Pages can take a significant amount of processing time. This can cause a bottleneck in your document process while threads are stuck working on splitting these large PDF files for each document folder in the Batch.

This is because both image bursting and rendering take the most time to complete. Your processing threads are simply stuck doing the laborious job of either copying out the page's image (in the case of image bursting) or redrawing the page (in the case of rendering).


On the other hand, splitting out pages with PDF page extraction is less processing intensive and takes less time.

With PDF pages split out, the processing intensive portion of the whole thing (image bursting or rendering) can be offloaded to the Rasterize command using the Execute activity. Now that we have page objects, we can run the Execute activity on the page level and use the Rasterize command to perform image bursting or rendering.

Instead of just a single thread processing the document using Split Pages alone, multiple threads are used to burst or render images for each page object.

How To

Using the Rasterize command is literally a two step process in a Batch Process.

  1. Add a Split Pages step with Image Bursting and Rendering disabled.
  2. Add an Execute activity with the Batch Page > Rasterize command added.
    • Then enable Image Bursting if you want to burst images from image-based PDF pages.
    • Or, enable Rendering if you want to render all PDF pages as JPEG images.

Step 1: Split Pages

First, you will add a Split Pages step you your Batch Process. This should be placed in your Batch Process wherever it makes logical sense according to your organization's Batch Process design. Typically, this will be one of the first steps in the Batch Process.

  1. Add a step to the Batch Process.
  2. For the Activity Type property, select Split Pages.
    • Hint: This is found in the "Document Transforms" heading in the dropdown selector.
  3. Ensure the PDF Page Extraction property is enabled.
  4. Set the Image Bursting property to Disabled.
  5. Set the Rendering property to Disabled.


This configuration setting will ensure all PDF pages are split as PDF page objects.

Step 2: Execute (Rasterize)

Next, you we will implement the Rasterize command. To do this we will need to add an Execute step to the Batch Process.

  1. Add a step to the Batch Process.
  2. For the Activity Type property, select Execute.
    • Hint: This is found in the "Utilities" heading in the dropdown selector.
  3. Set the Scope property to Page.
    • DO NOT FORGET THIS STEP. The Execute activity's Scope defaults to Folder. The Rasterize command is designed to operate on the page level. If you don't change the processing scope to the page level, it won't do anything to the items in your Batch at all.
  4. # Press the ellipsis button at the end to add a new command.


  1. This will bring up the "Commands" list collection editor window.
  2. Press the Add button.
  3. This will add a new command to the Execute activity's command list.
  4. Click the drop-down for the Command Property.
  5. In the drop-down menu select Batch Page - Rasterize.

Configure Rasterize to Render Images

If you want the Rasterize' command to render images, you will enable the Rendering property. This will generate a JPEG image for every page with the resolution, color format, and color depth settings defined in the Rendering sub-properties.

  1. Set the Bursting' property is Disabled.
  2. Set the Rendering property is Enabled.
  3. Press OK when finished.

Configure Rasterize to Burst Images

If you want the Rasterize' command to burst images, you will leave the default settings configured. This will leave text-based PDF pages alone, but will burst out the images from image-based PDFS, resulting in a JPEG image for the Batch Page.

  1. Ensure the Bursting property is Enabled.
  2. Ensure the Rendering property is Disabled.
  3. Press OK when finished.

Is Rasterize for Me?

When Rasterize is for you:

  1. If your PDF documents are large, multipage files with image-based pages (more than, say, 50), you will benefit from running Rasterize.
  2. If you are using Rasterize to render images (as opposed to burst images), you will receive the most benefit from running Rasterize.
    • FYI: Rendering takes more time than image bursting because the image is literally re-drawn instead of directly copied.
  3. If your PDF documents contain "Multi-Image" PDF pages and need to have their images burst, you must use Rasterize to do so.
    • "Multi-Image" PDF pages will be split as PDF pages when Split Pages runs with Image Bursting enabled, but will be split as JPEG images when Rasterize runs with Image Bursting enabled.

In these cases it is advised to use the Split Pages activity followed by the Execute (Rasterize) activity.

When Rasterize is NOT for you:

  1. If your PDF documents are 100% text-based pages, you will not benefit from running Rasterize.
  2. If your PDF documents are image-based but small (less than, say, 50 pages), you will receive little (and possibly no) benefit running Rasterize.
  3. If your PDF documents have mixed pages, where some are image-based and others are text-based, but only a few image-based pages per document, you will receive little to no benefit running Rasterize.

In these cases it is advised to only use the Split Pages activity.

What if I'm not sure if Rasterize is for me, or it seems some of my documents coming in should use it and others won't benefit from it?

We often recommend using Rasterize as a "standard" Batch Process configuration. Even if it doesn't dramatically help your processing efficiency, it typically will not dramatically reduce it either. If you are processing a decent amount of image-based PDF pages, you should at least try out using Rasterize to speed up the split operation. If it helps, great! If it's a net neutral, no harm, no foul. If it's increasing your processing times, simply remove the Execute (Rasterize) activity from your Batch Process and reconfigure your Split Pages activity as needed.