2021:Split Pages (Activity): Difference between revisions
Dgreenwood (talk | contribs) |
Dgreenwood (talk | contribs) |
||
| Line 187: | Line 187: | ||
{|cellpadding=10 cellspacing=5 | {|cellpadding=10 cellspacing=5 | ||
|valign=top style="width:40%"| | |valign=top style="width:40%"| | ||
You will use the '''''PDF Options''''' properties to achieve your desired result. How you configure these properties will determine whether the page object is created as a PDF page or a JPEG image. | You will use the '''''PDF Options''''' properties to achieve your desired result. How you configure these properties will determine whether the page object is created as a PDF page or a JPEG image. | ||
| Line 194: | Line 193: | ||
#* This property determines whether or not split page objects are created as PDFs. | #* This property determines whether or not split page objects are created as PDFs. | ||
#* Setting this property to ''Auto'', ''Selective'' or ''Standard'' will enable PDF extraction. | #* Setting this property to ''Auto'', ''Selective'' or ''Standard'' will enable PDF extraction. | ||
#** All text-based PDF pages will be split as PDF pages. | #** All ''text-based'' PDF pages will be split as PDF pages. | ||
#** Image-based PDF pages will be split as JPEG page objects if '''''Image Bursting''''' is ''Enabled'', or PDF page objects if '''''Image Bursting''''' is ''Disabled. | #** ''Image-based'' PDF pages will be split as JPEG page objects if '''''Image Bursting''''' is ''Enabled'', or PDF page objects if '''''Image Bursting''''' is ''Disabled. | ||
#* Setting this property to ''Disabled'' will force all PDF pages to be split as JPEG pages. | #* Setting this property to ''Disabled'' will force all PDF pages to be split as JPEG pages. | ||
# '''''Image Bursting''''' | # '''''Image Bursting''''' | ||
#* When '''''PDF Page Extraction''''' is enabled, this property determines how ''image-based'' PDF pages are generated as JPEG images. | |||
#* The page's whole image is extracted (or "burst") from the PDF. This process reconstructs the image pixel-by-pixel. It is an unaltered copy, extracting the raw image resource with no alterations to the pixel values or original resolution. | |||
# '''''Rendering''''' | # '''''Rendering''''' | ||
#* When '''''PDF Page Extraction''''' is ''Disabled'', this property determines how PDF pages are generated | #* When '''''PDF Page Extraction''''' is ''Disabled'', this property determines how ''all'' PDF pages are generated as JPEG images. | ||
#* Rendering an image differs from bursting in that a ''new'' image is drawn on the new page at a specified resolution. | |||
| | |||
[[File:Split-pages-about-09.png]] | [[File:Split-pages-about-09.png]] | ||
|} | |} | ||
| Line 221: | Line 223: | ||
[[File:Split-pages-pre-split-page-types-graphic.png|left|496px]] | [[File:Split-pages-pre-split-page-types-graphic.png|left|496px]] | ||
|} | |} | ||
<tabs style="margin:20px"> | |||
<tab name = "Default Settings - Bursting Images" style="margin:20px"> | |||
=== Default Settings - Bursting Images === | |||
</tab> | |||
<tab name = "Splitting All Pages as PDFs - PDF Page Extraction ONLY" style="margin:20px"> | |||
=== Splitting All Pages as PDFs - PDF Page Extraction ONLY === | |||
</tab> | |||
<tab name = "Splitting All Pages as JPEG images - Rendering" style="margin:20px"> | |||
=== Splitting All Pages as JPEG images - Rendering === | |||
</tab> | |||
</tabs> | |||
=== Why do I care? === | === Why do I care? === | ||
Revision as of 10:52, 28 April 2022
Split Pages is an activity that will split a multi-page PDF or TIF document into individual pages.
When applied to a Batch Folder with an attached PDF or TIF file, the Split Pages activity will create a Batch Page object for each page in the file, which are created as children of the Batch Folder.
About
|
|
|
|
We can also process this document at this point. We can apply Grooper activities at the folder level, to this Batch Folder (by setting a Batch Process Step's Scope property to Folder). An activity running on the folder level can manipulate the content in the attached file. For example, if we ran the Recognize activity at the folder level, it would obtain text data from the attached PDF file. |
|
|
|
Why Split Pages?
There are two reasons to use the Split Pages activity to split out pages from a multipage document.
- To apply activities that require Batch Page objects to function.
- Namely the Image Processing and Separate activities.
- To increase compute efficiency.
- A Batch Folder is a single object, which can be processed by a single processing thread. If you split out the attached document's pages, each page becomes its own object in the Batch. Each page can also only be processed by a single thread, but with multiple page objects now present, multiple threads can now be used to process the document (one for each page).
Splitting Pages for Specific Activities
Certain Grooper activities require Batch Page objects by design.
- The Separate activity separates loose pages into folders. If there's no Batch Page objects, there's nothing to separate.
- The Image Processing activity applies an IP Profile to mutate a page's image in order to clean it up before OCR processing during the Recognize activity. In all but the narrowest of use cases, the Image Processing activity must process Batch Page objects, not Batch Folders. If there's no Batch Page objects, there's nothing for the IP Profile to clean up.
|
So, first we need to run the Split Pages activity to add page objects we can manipulate. Then, we can run the Separate activity to separate those pages into folders. |
|
|
|
|
|
|
Splitting Pages to Increase Efficiency
Often a Split Pages activity is added as one of the first steps in the Batch Process to increase processing efficiency. Why? To take full advantage of your systems multithreaded processing capabilities.
Many activities can run on the folder or page level and produce the same end result. For example, the Recognize activity will perform OCR and/or native text extraction to obtain text data for a document. This activity can run either on the Batch Folder or Batch Page level.
When running on a Batch Folder level, it will obtain text data from the attached PDF or TIF file. When dealing with large multi-page files, you can encounter a processing bottleneck, leading to increased processing times, if you're only running activities of the folder level.
|
If you apply the activity at the folder level, each thread will process a single Batch Folder.
|
|
|
If you apply the Recognize activity at the page level, each thread will process a single Batch Page.
|
When should I use Split Pages to increase efficiency?
You will get the greatest benefit from the Split Pages activity if you are processing larger multipage PDF or TIF files.
- The more pages there are in the source file, generally the greater the efficiency reward.
- Furthermore, the more processing able to be done at the page level, the more efficiency you'll reap from splitting pages. For example, it will be advantageous for you to use Split Pages to make a Recognize step in your Batch Process more efficient. It will be even more advantageous if you have both a Recognize step and an Image Processing step in your Batch Process, as both those activities can be executed at the page level.
You will receive less benefit if your imported files are small, one to two page PDFs or TIF files.
- In that case the processing effort to split the pages may not be made up by increased parallelism in subsequent steps in your Batch Process.
Are there any drawbacks to the Split Pages activity?
Be aware the Split Pages activity will necessarily eat into your Grooper file store's storage space. By creating a new page object, you're effectively making a copy of the PDF or TIF's page. That page's image must be stored somewhere (your Grooper Repository's file store location).
- If your Grooper file store is severely limited in size, you may not have the digital space necessary to split out several large digital file's pages.
- Also, keep in mind SSD storage is faster than HDD storage. You may experience latency if your Grooper file store's working storage is itself a slower storage medium.
Bursting and Rendering PDFs
There are a few different ways to split pages from PDF documents in Grooper. Being aware of how the page objects are created can greatly increase your processing efficiency in many ways.
First, you must be aware of the differences between "image-based" PDF page and "text-based" (also called "true" or "native-text") PDF pages.
|
An image-based PDF page's visible content is defined by a single image. For example, this could be a scanned paper page saved to a PDF page. This is the realm of raster graphics. A mosaic of pixels constructs the image presented to the reader on a computer screen.
| |
|
A text-based PDF page's visible content is defined by digitally authored content. For example, a PDF form created from scratch in Adobe Acrobat or a Word file printed to a PDF format would both be text-based PDF pages. This is the realm of vector graphics. Text and graphics are rendered visually by mathematical formulas.
|
Depending on how the Split Pages activity is configured, one of three things is going to happen:
|
1. All page objects will be created as PDF pages. |
2. All page objects will be created as JPEG images. |
3. Page objects will be conditionally created as PDF pages or JPEG images. |
||
|
Text-based pages will be created as PDF pages.
|
|
You will use the PDF Options properties to achieve your desired result. How you configure these properties will determine whether the page object is created as a PDF page or a JPEG image. Depending on your goal, you will enable or disable a combination of the following properties.
|
Default Settings - Bursting Images
Splitting All Pages as PDFs - PDF Page Extraction ONLY
Splitting All Pages as JPEG images - Rendering
Why do I care?
Why should you care whether or not the page is split as a PDF or JPEG image? There are several reasons. Some are practical, relating to necessary processing requirements. Others relate to computing efficiency.
Practical Considerations
- I have text-based PDFs with purely native, digitally encoded text.
- Text-based PDFs already have text data embedded in the page. Whereas images must be OCR'd to get text data.
- If your documents are text-based, you more likely than not just want to extract the raw native text data from the PDF. There's no reason to OCR, if your documents already have good, machine readable text embedded in them.
- Split as PDF pages to perform native text extraction during Recognize.
- Ensure PDF Page Extraction is enabled.
- My PDF documents have a mix of text-based and image-based pages.
- This is what Bursting is for. Bursting a PDF will create PDF page objects for text-based pages and JPEG page objects for image-based pages.
- Ensure PDF Page Extraction is enabled and Image Bursting is Enabled.
- This is what Bursting is for. Bursting a PDF will create PDF page objects for text-based pages and JPEG page objects for image-based pages.
- I have image-based PDF pages that require substantial permanent image processing cleanup.
- The Image Processing activity performs permanent image cleanup on a digital image. With few exceptions, it is preferable for Image Processing to process a JPEG image, not a PDF page (See the #Image Processing Considerations section for more details).
- Split image-based PDF pages as JPEG images to clean up pages with the Image Processing activity.
- If PDF Page Extraction is enabled, ensure Image Bursting is Enabled.
- If PDF Page Extraction is Disabled, ensure Rendering is Enabled.
- I have "Searchable" PDF pages and I want Grooper to re-OCR the image to get new text data.
- You want to end up with a JPEG page object in this case.
- This is also what Bursting is for. Searchable PDFs are considered image-based. Bursting these pages will generate a JPEG image for the page, ensuring the Recognize activity will perform OCR.
- If PDF Page Extraction is enabled, ensure Image Bursting is Enabled.
- If PDF Page Extraction is disabled, ensure Rendering is Enabled.
- I have "Searchable" PDF pages and I do not want Grooper to re-OCR the images. I want the embedded text overlay extracted.
- You want to end up with a PDF page object in this case.
- You MUST disable Bursting in this case. Disabling Bursting will ensure a PDF page object is created, with the embedded text data still present. This will allow the Recognize activity to extract the embedded text data.
- Ensure PDF Page Extraction is enabled and Image Bursting is Disabled.
- I have text-based PDFs with malformed or semi-corrupted encoded text. Or, the PDF is severely corrupted or malformed to the point it can't be split using normal PDF page extraction.
- You may need to end up with all JPEG page objects in this case.
- Sometimes text data can be corrupted. While a page appears readable to a human, the encoded text data may be a bunch of gibberish. Furthermore, some PDF authors will intentionally obfuscate text characters by swapping character glyphs as a kind of watermarking. For example, a printed "A" on the page may be encoded as a "£".
- In both cases, while the printed characters look fine visually, the underlying text data is inaccurate. If the issue is bad enough, you may need to OCR the pages instead of extracting the embedded text data to get the most accurate text data from the document. Splitting text-based PDF pages as JPEG images will ensure Recognize will OCR the pages (Since they are JPEGs and not PDFs, there's no native text to extract. So Recognize is forced to use OCR).
- Ensure PDF Page Extraction is Disabled and Rendering is Enabled.
- I have "Mixed" PDF pages that have a combination of native and printed text. I need text data for both.
- You want to end up with a PDF page object in this case.
- This is more of a Recognize issue than a Split Pages issue. The Recognize activity is designed to extract native text segments from a PDF and perform OCR from the image-based portions, even when both are present on the same page. You just need to make sure a PDF page object is created, not a JPEG object. The PDF page will have both the native digital text and the image embedded in its resources.
- Ensure PDF Page Extraction is enabled.
Efficiency Considerations
If you have PDFs with predominately text-based pages, the default Split Pages configuration should be just fine. However, if you have large, multipage PDFs with image-based pages, you may want to consider using the Rasterize command to increase your processing efficiency.
In the next section, will discuss using Rasterize to increase processing efficiency when splitting large multipage image-based PDFs.
Using the Rasterize Execute Command
Generating image content takes time for your computer to do the drawing operations required to produce the image. Whether images are extracted from PDF pages using Image Bursting or re-drawn using Rendering, there will be a performance cost of some sort. The cost is higher the more pages there are in the document. The Rasterize command allows this process to be run multithreaded, increasing efficiency.















