PDF Page Types (Concept)

From Grooper Wiki
Revision as of 15:27, 28 August 2024 by Dgreenwood (talk | contribs)

PDF pages can be one of several PDF Page Types. "Page types" describe the kind of content in a PDF page. This informs Grooper how certain Activities should process the page. For example, "single image" pages are OCR'd by the Recognize activity, where "text only" pages have their native text extracted by Recognize.

There are 7 potential page types a PDF can be:

  • Text Only
  • Mixed
  • Single Image
  • Multi Image
  • Searchable
  • Vector Only
  • Overflow

Certain Grooper activities will behave differently depending on the PDF page type.

  • Split Pages: With Bursting enabled, page types comprised of a single image will be split into JPEG pages instead of PDF pages. This allows Grooper to treat them as full images for image processing and OCR.
  • Recognize: With Native Text Extraction enabled, native text is extracted (only if the page has native text, as determined by the page type). With an OCR Profile referenced, image content will be run through OCR.
  • Image Processing: The Image Processing activity will only alter PDF pages if they are "image based" pages (Single Image or Searchable).

Page Types

The ZIP file linked below contains example PDFs. Each one has pages that are one of the listed page types.


Text Only

  • The page is purely digital. It is made of natively authored digital text and any drawn objects are vector-based.
  • Does Split Pages "burst" it? No
  • Does Recognize extract text? Yes. Any native text segments are extracted.
  • Does Recognize OCR text? No. There is no image content to recognize for this type of page.

Mixed

  • The page contains a mixture of native text, vector drawings, and images.
  • Does Split Pages "burst" it? No
  • Does Recognize extract text? Yes. For native text only. If present, invisible text is not extracted.
  • Does Recognize OCR text? Yes. For image portions only.

Single Image

  • The page is comprised of a single image and nothing else (No digital text or vector drawings).
  • Does Split Pages "burst" it? Yes
  • Does Recognize extract text? No. No native text is present.
  • Does Recognize OCR text? Yes. The whole page is run through the OCR Profile.

Multi Image

  • The page is comprised of multiple images, and nothing else . For example, some PDF generation software will break up a single image into multiple images for better file compression purposes. Or, it may otherwise stitch together multiple images on a single page. PDF pages are multi-image when no digital text/vector graphics are present (They would be "Mixed" when digital text/vector graphics are present).
  • Does Split Pages "burst" it? NO. The Bursting property can only burst single images from a PDF (i.e. Single Image or Searchable PDF pages).
  • Does Recognize extract text? No. No native text is present.
  • Does Recognize OCR text? Yes. Each image is run through the OCR Profile.

Searchable

  • The page is comprised of a single image which is supplemented by non-visible text.
  • Does Split Pages "burst" it? Yes
  • Does Recognize extract text? No. The invisible text overlaid on the image is not considered native text and is not extracted.
  • Does Recognize OCR text? Yes. Recognize treats the page as a "Single Image" PDF page.

Vector Only

  • The page contains vector drawn objects only. No text is present.
  • Does Split Pages "burst" it? No
  • Does Recognize extract text? No. No native text is present.
  • Does Recognize OCR text? No. No image content is present (Vector graphics are not raster images that can be OCR'd).

Overflow – The drawing operators are too numerous to analyze. These can be malformed PDF pages or are purposefully made in ways that makes them difficult to process. Often these pages use vector graphics to draw text, instead of embedding digital text, preventing Native Text Extraction.

  • Does Split Pages "burst" it? No
  • Does Recognize extract text? No

Glossary

Extract: export_notes Extract is an Activity that retrieves information from folder Batch Folder documents, as defined by Data Elements in a data_table Data Model. This is how Grooper locates unstructured data on your documents and collects it in a structured, usable format.

OCR Profile: library_books OCR Profiles store configuration settings for optical character recognition (OCR). They are used by the Recognize activity to convert images of text on contract Batch Pages into machine-encoded text. OCR Profiles are highly configurable, allowing fine-grained control over how OCR occurs, how pre-OCR image cleanup occurs, and how Grooper's OCR Synthesis occurs. All this works to the end goal of highly accurate OCR text data, which is used to classify documents, extract data and more.

OCR: OCR is stands for Optical Character Recognition. It allows text on paper documents to be digitized, in order to be searched or edited by other software applications. OCR converts typed or printed text from digital images of physical documents into machine readable, encoded text.

PDF Page Types: PDF pages can be one of several PDF Page Types. "Page types" describe the kind of content in a PDF page. This informs Grooper how certain Activities should process the page. For example, "single image" pages are OCR'd by the Recognize activity, where "text only" pages have their native text extracted by Recognize.

Recognize: format_letter_spacing_wide Recognize is an Activity that obtains machine-readable text from contract Batch Pages and folder Batch Folders. When properly configured with an library_booksOCR Profile, Recognize will selectively perform OCR for images and native-text extraction for digital text in PDFs. Recognize can also reference an perm_mediaIP Profile to collect "layout data" like lines, checkboxes, and barcodes. Other Activities then use this machine-readable text and layout data for document analysis and data extraction.

Split Pages: Multi-page PDF and TIF files come into Grooper as files attached to single folder Batch Folders. Split Pages is an Activity that creates child contract Batch Pages for each page in the PDF or TIF. This allows Grooper to process and handle these pages as individual objects.

Split: Split is a Collation Provider option for pin Data Type extractors. Split separates a data instance at each match returned by the Data Type. The results are used as anchor points to "split" text into one or more smaller parts.