2023:Separation (Concept)

From Grooper Wiki

This article is about an older version of Grooper.

Information may be out of date and UI elements may have changed.

20252023.120232.80

Separation is the process of taking an unorganized inventory_2 Batch of loose contract Batch Pages and organizing them into documents represented by folder Batch Folders in Grooper. This is done so Grooper can later assign a description Document Type to each document folder in a process known as "classification".

Pages are organized into document folders during the Separate activity. There are a variety of methods to separate pages into documents during this activity, including (but not limited to) the use of printed control sheets, defined page lengths, and extractible text content. The specific separation method is determined by the Separation Provider and its configuration used during the Separate activity. You may also save and re-use a Separation Provider's configuration settings by creating a Separation Profile.

About

Imagine you have a big stack of paper pages. You need to organize these pages into certain kinds of documents, HR documents, accounts payable documents, accounts receivable documents, all kinds of documents. Before you can even get to the point of determining which document is which, you have to ask yourself a question. Is this stack of papers one huge document? Is each page their own document? How many documents are in this stack?

At what point does one document start and another begin?

Separation seeks to go through a stack of pages, one by one, and determine where a document begins and where it should end (most often where the next document begins). Is there some kind of cover page for each document? Is there something like a title or a page number indicating the first page? Are all documents just the same page length? Once you can answer these kinds of questions, you know where one document starts and another begins and distinguish between the loose pages and the documents they compose.

Grooper's document separation (via the Separate activity and Separation Providers) answers this question and automates its answer. Grooper operates much the same way in terms of analyzing loose pages and figuring out where one document starts and another begins. How these beginning and ending points are established, understood and executed is determined by which Separation Provider is used and how it is configured. Once that logic is established and configured, separation can be automated by the Separate activity.

Batch Basics - What is a document anyway?

A Batch is the fundamental unit of document processing in Grooper. It is functionally two things:

  1. A container for folders and pages.
  2. A list of processing instructions to do something with those folders and pages.

As such, all Batches consist of three things:

  1. The Batch itself.
  2. A root Batch Folder.
  3. A Batch Process named Null Process—this is automatically added with all newly created Batches.
    • Batch Processes can also be created within a Project.

  1. The root Batch Folder houses Batch Page and Batch Folder objects.
    • As the top level (or root) of the Batch's Batch Folder hierarchy, containing all child Batch Folders and their contents, the root Batch Folder is often referred to as simply the "Batch".
      • However, technically the Batch and root Batch Folder are two separate objects in Grooper.
    • Alternatively, the root Batch Folder may be referred to as the "Batch Folder" and its child Batch Folders as simply "folders".
      • However, from a technical standpoint, they are all Batch Folder objects.


  1. The Batch Process is a step by step set of configurable processing instructions comprised of Batch Steps, each one of which performs a different Activity in Grooper.
    • Each Batch Step will be named after the Grooper Activity it executes by default.


As far as Grooper is concerned, a "document" is a Batch Folder object with one or more Batch Page objects as its children.

As the method for organizing pages into documents, separation is the activity of inserting Batch Folders before a Batch Page that is the logical first page of a document and moving subsequent Batch Pages into that Batch Folder until the next Batch Page matches the logical requirements for a first page of a document.

Once all loose Batch Pages are successfully placed into Batch Folders, Grooper has created documents out of pages, and the Batch is officially separated.

FYI

For brevity's sake, names for Grooper objects in a Batch generally get shorthand terms associated with them.

The term "page" is used interchangeably with Batch Page.

The terms "document" or "document folder" are used interchangeably with a Batch Folder that contains Batch Pages.

The term "folder" is used interchangeably with any other Batch Folder in the root Batch Folder.

Just because a document is separated does not mean it is classified. A Batch Folder is not classified until it has been assigned a Document Type of a Content Model. While some Separation Providers allow you to both separate and classify documents at the same time, not all of them do. Separation and classification should be considered two distinct (but related) things.

Do I Need To Separate?

Remember, a document, from Grooper's perspective, is a folder with pages in it. In order for Grooper to automate document classification and eventually data extraction, pages must first be organized into Batch Folders.

Does that mean every Batch Process must include a Separate activity step?

No. Not necessarily.

Depending on the circumstances, you may not need to apply separation. There are situations where you absolutely do necessarily have to separate a Batch and there are times you do not. This all depends on how content comes into Grooper.

Documents come into Grooper in one of two ways. Ether...

  1. They are scanned in using a scanner.
    • In the case of scanned documents, the content comes into a new Batch as Batch Pages. Pages are not folders. In this case you always must separate the Batch Pages into Batch Folders.
  2. They are imported from an external storage platform.
    • In the case of imported documents, the content comes into a new Batch as Batch Folders. This is more situational, sometimes you will need to apply separation and sometimes you will not, depending on the imported files.

If the files imported into Grooper are "discrete" or "individual" documents, there is no need to separate. Separation is all about creating Batch Folders from pages. However, for imported files, they are imported as Batch Folders in a Batch. If each file corresponds to one document, they're already separated upon importing. There is no further need for a Separate activity step in a Batch Process.

However, if the imported files are "packet" documents, you will need to separate. In the case of packet files, there are multiple documents in each file. However, Grooper will import the file as a single Batch Folder in a Batch. Grooper doesn't know there's more than one document in the file until you "tell" it. You would "tell" Grooper this through separation.

The decision tree for separation would look like this:

Separation Providers

Separation Providers establish the logic used to create "separation points" or "binding points" between loose pages. There are a multitude of methods to separate pages into document folders in Grooper. Each Separation Provider has its own criteria for determining where these separation points occur within a batch. However the basic operation is same for all of them.

  1. Determine what page is the first page of a document.
    • This is the "separation point" or "binding point".
    • Generally, the first page in a batch is always the first separation point.
  2. Insert a Batch Folder into the Batch.
  3. Move pages into that folder until another first page of a document is encountered.
  4. Insert a new Batch Folder into the Batch
    • This is the next "separation point" or "binding point".
  5. Move pages into that folder until another first page of a document is encountered.
  6. Repeat until the end of the Batch.



The Separation Provider is selected and configured using the Provider property of the Separate activity or a Separation Profile.

In a Batch Process, you will set the Separation Provider using the Provider property of a Separate step.

  1. Select a Batch Process
  2. Add a Batch Step and assign it the Separate activity type (or select the Separate step in the Batch Process if already present).
  3. Use the Provider property to select a Separation Provider.

A Separation Profile is a way to configure a Separation Provider and save it to an object that can be reused multiple times in multiple Batch Processes. Instead of configuring on the Separate step itself, you can reference a Separation Profile with those configurations already set. Either way, separation's configuration is the same. Separation Profiles just allow you to save these settings outside of a single Batch Process.

  1. You can add a Separation Profile by right-clicking a Project, selecting "Add", followed by "Separation Provider".
  2. Select a Separation Profile
  3. Use the Provider property to select a Separation Provider.

Provider Types

There are eight total Separation Providers.

  • Control Sheet Separation - New folders are created using Grooper Control Sheets.
  • Event-Based Separation - The Batch is separated using one or more "Separation Events". Each Separation Event triggers the creation of a new folder. The events are as follows:
    • Blank Page - A blank page will trigger a new folder.
    • Barcode - A scanned barcode will trigger a new folder.
    • Content Type - This Separation Event uses Lexical or Visual training examples to trigger folder creation. Whenever a page confidently matches a trained example document's first page, a new folder is created.
    • Page Count - This is for fixed page separation. A new folder is created by a set number of pages for a document.
    • Shape - A new folder is created every time a "shape feature" is detected. Shape features are detected using a Shape Detection IP Command from an IP Profile.
  • Pattern-Based Separation - Folder creation is determined by an extractor. If the extractor returns a result on a page, a new folder is created. Subsequent pages are placed in that folder until another page produces a result.
  • Change in Value Separation - This provider is similar to Pattern-Based Separation in that an extractor also determines folder creation. However, folders are only created when the extractor's result changes.
  • EPI Separation - Separation occurs using embedded page information (EPI) supplied by an extractor. This provider is helpful for separating documents whose page numbers are extractable.
  • ESP Auto Separation - ESP automatic separation performs document separation with multiple operations working together, using Lexical training examples in a Content Model, the Separation properties of Document Types, embedded page information, and merging designated "attachment" Document Types to "host" Document Types.
    • Furthermore, since ESP Auto Separation uses a Content Model's training data (as well as classification rules set on its Document Types), it both separates and classifies documents during the Separate activity.
  • Multi Separator - Performs separation using multiple separation providers.
  • Undo Separation - The anti-separator! As its name implies this provider "undoes" separation, removing all Batch Folders in a Batch or Batch Folder level in the folder hierarchy, leaving only loose pages.

Real Time vs Lexical Providers

There are two different categories these Separation Providers can be placed in:

  • Real Time
  • Lexical

The main distinction between these two is the "Lexical" providers require machine readable text data. They use data extractors (using regular expression pattern matching) to determine the separation points in a Batch. For scanned page images, OCR obtains this data. Digital documents, such as PDFs, have machine readable text encoded in the file, but it needs to be extracted in a way Grooper can use it. Either way, the documents need to be conditioned with a Recognize step in a Batch Process to obtain this text data.

The "Real Time" providers do not require text data in order to separate documents. They use visual page information or fixed page numbers to find the separation points in a Batch. This means these providers can separate documents in real time during scanning. Since no extra document conditioning is required, there is no need for a Separate step in a Batch Process.

  1. You can add a Scan step to your Batch Process
    • This step is called "Review", but you are able to perform scanning. Here, we've changed the name of the step to "Scan". You can do the same when adding the Batch Process Step.
  2. Select Scan View in the Views property.
    • A Scanner Profile must be set up. You can either reference a folder where user-selectable scanner profiles are stored, or specify a specific Scanner Profile.
  3. To perform separation, select the Separation Profile.
    • Make sure to select a Separation Profile that can be applied in real time during scanning.

As long as the Separation Provider used is a Real Time provider, the documents will separate as they are scanned in. Folders will be inserted according to the Separation Profile's configuration. Here, using the Control Sheet Separation provider.

  • Note: This does not mean you can't use Real Time Separation Providers in a Separate step. You just have the option of performing separation during scanning using them.

The following Separation Providers are "Real Time" providers:

The following Separation Providers are "Lexical" providers: