PDF Page Types (Concept)
Pages in a PDF have a certain "page type". Page types describe the kind of content in a PDF page and informs Grooper how certain Activities should process the page. For example, "single image" pages are OCR'd by the Recognize activity where "text only" pages have their native text extracted.
There are 7 potential page types a PDF can be:
- Text Only
- Mixed
- Single Image
- Multi Image
- Searchable
- Vector Only
- Overflow
Certain Grooper activities will behave differently depending on the PDF page type.
- Split Pages: With Bursting enabled, page types comprised of a single image will be split into JPEG pages instead of PDF pages. This allows Grooper to treat them as full images for image processing and OCR.
- Recognize: With Native Text Extraction enabled, native text is extracted (only if the page has native text, as determined by the page type). With an OCR Profile referenced, image content will be run through OCR.
Page Types
Text Only
- The page is purely digital. It is made of natively authored digital text and any drawn objects are vector-based.
- Does Split Pages "burst" it? No
- Does Recognize extract text? Yes. Any native text segments are extracted.
- Does Recognize OCR text? No. There is no image content to recognize for this type of page.
Mixed
- The page contains a mixture of native text, vector drawings, and images.
- Does Split Pages "burst" it? No
- Does Recognize extract text? Yes. For native text only. If present, invisible text is not extracted.
- Does Recognize OCR text? Yes. For image portions only.
Single Image
- The page is comprised of a single image and nothing else (No digital text or vector drawings).
- Does Split Pages "burst" it? Yes
- Does Recognize extract text? No. No native text is present.
- Does Recognize OCR text? Yes. The whole page is run through the OCR Profile.
Multi Image
- The page is comprised of multiple images, and nothing else . For example, some PDF generation software will break up a single image into multiple images for better file compression purposes. Or, it may otherwise stitch together multiple images on a single page. PDF pages are multi-image when no digital text/vector graphics are present (They would be "Mixed" when digital text/vector graphics are present).
- Does Split Pages "burst" it? NO. The Bursting property can only burst single images from a PDF (i.e. Single Image or Searchable PDF pages).
- Does Recognize extract text? No. No native text is present.
- Does Recognize OCR text? Yes. Each image is run through the OCR Profile.
Searchable
- The page is comprised of a single image which is supplemented by non-visible text.
- Does Split Pages "burst" it? Yes
- Does Recognize extract text? No. The invisible text overlaid on the image is not considered native text and is not extracted.
- Does Recognize OCR text? Yes. Recognize treats the page as a "Single Image" PDF page.
Vector Only
- The page contains vector drawn objects only. No text is present.
- Does Split Pages "burst" it? No
- Does Recognize extract text? No. No native text is present.
- Does Recognize OCR text? No. No image content is present (Vector graphics are not raster images that can be OCR'd).
Overflow – The drawing operators are too numerous to analyze. These can be malformed PDF pages or are purposefully made in ways that makes them difficult to process. Often these pages use vector graphics to draw text, instead of embedding digital text, preventing Native Text Extraction.
- Does Split Pages "burst" it? No
- Does Recognize extract text? No