Batch Page (Node Type)

From Grooper Wiki
Revision as of 11:56, 30 July 2025 by Dgreenwood (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

This article is about the current version of Grooper.

Note that some content may still need to be updated.

2025

contract Batch Page nodes represent individual pages within a inventory_2 Batch. Batch Pages are created in one of two ways: (1) When images are scanned into a Batch using the Scan Viewer. (2) Or, when split from a PDF or TIFF file using the Split Pages activity.

  • Batch Pages are frequently referred to simply as "pages".

Overview

A Batch Page in Grooper represents a single scanned or imported page within a Batch. It is the fundamental building block for document processing, serving as the digital equivalent of a physical page. Each Batch Page contains the image, text, and metadata for one page, and is managed within the hierarchical structure of a Batch.

What is a Batch Page?

A Batch Page is an object that encapsulates all information about a single page, including:

  • The page image (scanned or imported)
  • OCR text and layout data (if recognized)
  • Image renditions (color, grayscale, binary, thumbnail)
  • Metadata such as resolution, size, and file format
  • Flags and annotations for review or exception handling

Batch Pages are always children of a Batch Folder or the root Batch itself. They are created during import, scanning, or splitting of multi-page files, and are the starting point for all downstream processing in Grooper.

Purpose and usage

Batch Pages are used to:

  • Store and manage the digital representation of each page in a Batch
  • Provide the source material for document separation, classification, extraction, and review
  • Enable granular control over image processing, OCR, and data extraction at the page level
  • Support review and quality control workflows, including flagging and annotation

Users interact with Batch Pages when reviewing images, correcting OCR, flagging issues, or performing page-level operations such as rotation, cleanup, or deletion.

Batch Page hierarchy

Batch Pages exist within the hierarchical structure of a Batch:

inventory_2 Batch
folder Batch Folder
contract Batch Page
contract Batch Page
folder Batch Folder
contract Batch Page
contract Batch Page

This structure allows Grooper to organize pages into logical groups (folders or documents), supporting complex workflows such as separation, classification, and extraction.

Document separation

Document separation is the process of grouping loose Batch Pages into documents (Batch Folders) based on configurable logic. This is a critical step in transforming a collection of scanned pages into structured, searchable documents.

Separation is typically performed by the Separate activity.

  • The activity analyzes the sequence of Batch Pages and creates new Batch Folders (documents) according to rules defined by a Separation Provider.
  • Common separation methods include detecting barcodes, patch codes, blank pages, or using extractors to identify key text markers indicating separation points.
    • Separation is often performed after OCR recognition, so that text and layout data are available.
  • Separate can be configured to run at the Batch level (all loose pages) or at the folder level (pages within each Batch Folder).
  • Separation can be fully automated. However, a Review step allows human operators to adjust document boundaries before finalization.

Example: Before and after separation

Before separation, a Batch may contain only loose pages:

inventory_2 Batch
contract Page 1
contract Page 2
contract Page 3
contract Page 4


After separation, the pages are grouped into documents (Batch Folders):

inventory_2 Batch
folder Document 1
contract Page 1
contract Page 2
folder Document 2
contract Page 3
contract Page 4

Split Pages activity

The Split Pages activity is used to divide multi-page files (such as PDFs or TIFFs) into individual Batch Pages. This enables page-level processing, parallelization, and granular document management. This is often the first step after importing, ensuring that each page is represented as a separate Batch Page object.

Key features of Split Pages

  • Supports splitting PDF, TIFF, and other multi-page image formats.
  • Can apply filters to select which pages to split (using the "Page Filter" property).
  • Allows configuration of page limits, overwrite behavior, and whether to remove the original file after splitting.
  • Supports advanced PDF options, such as page extraction mode, compression, and bookmark replication.

Typical workflow

1. Import a multi-page file into a Batch. 2. Run the Split Pages activity to create individual Batch Pages for each page in the file. 3. Proceed with document separation, classification, extraction, and review.

Summary

Batch Pages are the core unit of content in Grooper, enabling granular control over document processing. They are created from scanned pages or by applying Split Pages to imported files. They are managed within Batches and Batch Folders, and serve as the foundation for separation, classification, and data extraction. The Separate activity automates the transformation of loose pages into structured documents, supporting efficient and accurate document-centric workflows.