Batch Processing Basics

From Grooper Wiki
Jump to navigation Jump to search
WIP This article is a work-in-progress. It may abruptly stop in the middle of a section and/or contain inaccurate information.

What is Batch Processing?

A batch is the fundamental container of documents in Grooper. Batch processing is everything done to your documents from getting them into Grooper, to reading their text, to classifying them , to getting information off them, to getting them out of Grooper. Every company and every project has their unique document processing needs. Grooper's functionality to build batch processing instructions is highly customizable and configurable to meet these needs. This article will give a broad overview of what batch processing looks like in Grooper.

Anatomy of a Batch

Before we get into processing documents, let's examine how a batch is structured in Grooper. There are three components to a batch:

  1. The Batch itself
  2. Batch Folders
  3. Batch Pages

Below is an extremely simple Batch.  We are viewing it Grooper Dashboard, using the "Batch Viewer" tab.

At the top level is the Batch, named here "Simple Batch". Contained inside are Batch Folders and Batch Pages.  Here, it is just a collection of .TIFF Batch Page images of invoices. We create can create a hierarchical structure by separating pages into folders.  In the example below, one Batch Folder was created for every Batch Page.

Multiple folder levels can be created in a batch depending on how complicated your documents are.  Speaking of documents, what does Grooper consider a document in a Batch?

What is a document?

At the most basic level, a document is just a collection of pages. How these pages get grouped together is what makes them a document. A W-4 from the IRS can be identified by the form number "W-4". A contract to lease oil and gas rights is identified by the contract's content from beginning, to middle, to end. An invoice is a list of purchased items from a certain vendor. If we saw these three documents in a stack we would intuitively group the pages corresponding to each document and separate them from each other. 

IRS Form w-4 Oil and Gas Lease Agreement Invoice
W4-batch processing basics.png Lease-batch processing basics.png Invoice-lease processing basics.png

How does Grooper distinguish between individual pages and documents? Batch Pages and Batch Folders. If documents are just collections of pages, Batch Folders are the containers distinguishing one collection of pages from another. Even if the document is only one page, it will be represented in Grooper as a Batch Folder containing a single Batch Page.

But, documents are more than just a bunch of papers in folders.  A document is truly the information inside it. Understanding a document's content is critically important to what makes one document different from another. Not only that, once you can tell one document from another, the specific information contained within an individual document is also important.

Loose pages in a batch. Organized into document folders and classified according to their content. Data extracted from an individual document.
1574449694679-135.png 1574449586040-245.png Data extract example.png

In Grooper, you will create a logical system to ingest loose pages, read their text data, turn them into organized documents, get relevant information from them, and send the documents and their data on their way. 

There are two parts to this. First all the logic required to represent a document set's content, including how to separate documents, classify them, and extract data from them, needs to be created. Once that is done, a workflow is created using this logic to take raw pages and process their content step by step. This workflow is done by creating and executing a Batch Process.

What is a Batch Process?