PDF Page Types (Concept)
(Redirected from PDF Page Types)
PDF pages can be one of several PDF Page Types. "Page types" describe the kind of content in a PDF page. This informs Grooper how certain Activities should process the page. For example, "single image" pages are OCR'd by the Recognize activity, where "text only" pages have their native text extracted by Recognize.
There are 7 potential page types a PDF can be:
- Text Only
- Mixed
- Single Image
- Multi Image
- Searchable
- Vector Only
- Overflow
Certain Grooper activities will behave differently depending on the PDF page type.
- Split Pages: With Bursting enabled, page types comprised of a single image will be split into JPEG pages instead of PDF pages. This allows Grooper to treat them as full images for image processing and OCR.
- Recognize: With Native Text Extraction enabled, native text is extracted (only if the page has native text, as determined by the page type). With an OCR Profile referenced, image content will be run through OCR.
- Image Processing: The Image Processing activity will only alter PDF pages if they are "image based" pages (Single Image or Searchable). It first converts image based pages into images then applies the IP Profile in these scenarios.
- Export/Merge: BE AWARE! Grooper can ONLY make text searchable PDFs from pages comprised of a single image. If there are digital elements like native text or vector graphics embedded in a PDF page, there is no way to insert a text searchable overlay into the page. The ONLY PDF Page Type Grooper can create a text searchable PDF page from during the Export or Merge activities are Single Image pages.
Page Types
The ZIP file linked below contains example PDFs. Each one has pages that are one of the listed page types.
- File:PDF Page Type Examples.zip
- BE AWARE: This is NOT Grooper ZIP. It's just a regular ZIP file with PDF files inside it.
Text Only
- The page is purely digital. It is made of natively authored digital text and any drawn objects are vector-based.
- Does Split Pages "burst" it? No
- Does Recognize extract text? Yes. Any native text segments are extracted.
- Does Recognize OCR text? No. There is no image content to recognize for this type of page.
Mixed
- The page contains a mixture of native text, vector drawings, and images.
- Does Split Pages "burst" it? No
- Does Recognize extract text? Yes. For native text only. If present, invisible text is not extracted.
- Does Recognize OCR text? Yes. For image portions only.
Single Image
- The page is comprised of a single image and nothing else (No digital text or vector drawings).
- Does Split Pages "burst" it? Yes
- Does Recognize extract text? No. No native text is present.
- Does Recognize OCR text? Yes. The whole page is run through the OCR Profile.
Multi Image
- The page is comprised of multiple images, and nothing else . For example, some PDF generation software will break up a single image into multiple images for better file compression purposes. Or, it may otherwise stitch together multiple images on a single page. PDF pages are multi-image when no digital text/vector graphics are present (They would be "Mixed" when digital text/vector graphics are present).
- Does Split Pages "burst" it? NO. The Bursting property can only burst single images from a PDF (i.e. Single Image or Searchable PDF pages).
- Does Recognize extract text? No. No native text is present.
- Does Recognize OCR text? Yes. Each image is run through the OCR Profile.
Searchable
- The page is comprised of a single image which is supplemented by non-visible text.
- Does Split Pages "burst" it? Yes
- Does Recognize extract text?
- Not by default. The invisible text overlaid on the image is not considered native text and is not extracted.
- Recognize will extract text if the Keep Searchable Text property is enabled.
- Does Recognize OCR text?
- Yes by default. Recognize treats the page as a "Single Image" PDF page.
- Recognize will not OCR if the Keep Searchable Text property is enabled. It will extract the non-visible text instead.
Vector Only
- The page contains vector drawn objects only. No text is present.
- Does Split Pages "burst" it? No
- Does Recognize extract text? No. No native text is present.
- Does Recognize OCR text? No. No image content is present (Vector graphics are not raster images that can be OCR'd).
Overflow
- The drawing operators are too numerous to analyze. These can be malformed PDF pages or are purposefully made in ways that makes them difficult to process. Often these pages use vector graphics to draw text, instead of embedding digital text, preventing Native Text Extraction.
- Does Split Pages "burst" it? No
- Does Recognize extract text? No
- Does Recognize OCR text? No
- The only way for Grooper to process Overflow page types is to convert them to an image by either disabling PDF Page Extraction in the Split Pages activity or running a Rasterize command on each page.