PDF Page Types (Concept)

From Grooper Wiki
Revision as of 13:03, 15 November 2023 by Dgreenwood (talk | contribs)

There are 7 potential page types a PDF can be:

  • Text Only
  • Mixed
  • Single Image
  • Multi Image
  • Searchable
  • Vector Only
  • Overflow

Certain Grooper activities will behave differently depending on the PDF page type.

  • Split Pages: With Bursting enabled, image-based page types will be split into JPEG pages instead of PDF pages. This allows Grooper to treat them as full images for image processing and OCR.
  • Recognize: With Native Text Extraction enabled, native text is extracted (only if the page has native text, as determined by the page type). With an OCR Profile referenced, image content will be run through OCR.

Page Types

Text Only

  • The page is purely digital. It is made of natively authored digital text and any drawn objects are vector-based.
  • Does Split Pages "burst" it? No
  • Does Recognize extract text? Yes. Any native text segments are extracted.
  • Does Recognize OCR text? No. There is no image content to recognize for this type of page.

Mixed

  • The page contains a mixture of native text, vector drawings, and images.
  • Does Split Pages "burst" it? No
  • Does Recognize extract text? Yes. For native text only. If present, invisible text is not extracted.
  • Does Recognize OCR text? Yes. For image portions only.

Single Image

  • The page is comprised of a single image and nothing else (No digital text or vector drawings).
  • Does Split Pages "burst" it? Yes
  • Does Recognize extract text? No. No native text is present.
  • Does Recognize OCR text? Yes. The whole page is run through the OCR Profile.

Multi Image

  • The page is comprised of multiple images, and nothing else . For example, linearized PDF pages are comprised of multiple images so they can load quicker when viewed over the internet. Linearized PDF pages are multi-image when no digital text/vector graphics are present (They would be "Searchable" or "Mixed" when digital text/vector graphics are present).
  • Does Split Pages "burst" it? Yes
  • Does Recognize extract text? No. No native text is present.
  • Does Recognize OCR text? Yes. Each image is run through the OCR Profile.

Searchable

  • The page is comprised of a single image which is supplemented by non-visible text.
  • Does Split Pages "burst" it? Yes
  • Does Recognize extract text? No. The invisible text overlaid on the image is not considered native text and is not extracted.
  • Does Recognize OCR text? Yes. Recognize treats the page as a "Single Image" PDF page.

Vector Only

  • The page contains vector drawn objects only. No text is present.
  • Does Split Pages "burst" it? No
  • Does Recognize extract text? No. No native text is present.
  • Does Recognize OCR text? No. No image content is present (Vector graphics are not raster images that can be OCR'd).

Overflow – The drawing operators are too numerous to analyze. These can be malformed PDF pages or are purposefully made in ways that makes them difficult to process. Often these pages use vector graphics to draw text, instead of embedding digital text, preventing Native Text Extraction.

  • Does Split Pages "burst" it? No
  • Does Recognize extract text? No