ESP Auto Separation (Separation Provider)

WIP

This article is a work-in-progress. It may be unfinished. It may abruptly stop in the middle of a section. It may contain inaccurate information.

ESP Auto Separation is one of Grooper's Separation Providers used for document separation. It leverages several different aspects of documents to determine where one document starts and the next begins in a Batch of loose pages, including classification data, the documents pagination structure, extracted page numbers, and rules for merging one Document Type with another. ESP Auto Separation is also one of the few Separation Providers that both separates and classifies documents at the same time, during the Separate activity.

ESP Auto Separation (often referred to simply as ESP) is often seen as the most effort intensive Separation Provider. It is a highly configurable provider (And, not all that configuration is done on the Separate step or a Separation Profile. Most of its functionality is actually determined by the associated Content Model's configuration). However, it is often the solution for the most complicated separation and classification challenges. ESP is extremely useful for document sets with a variety of structured, semi-structured, and unstructured documents.

About

There are four main components to ESP Auto Separation. Two of them are "core" functionalities, which are critical to understanding where the provider establishes separation points in a Batch. Two of them are "optional" functionalities, which can change the normal separation logic somewhat.

Core Functionality

Classification Data - Where's the first page of a document?
- The very basic idea behind ESP's core functionality is finding the first page of a document. In a very basic sense, if you know where the first page of a document is, you know where it starts. Where does it stop? Once you find the next first page of a document. How it finds the first page of a document is determined through trained examples of a Document Type using the Lexical classification method (or, in some configurations, the Positive Extractor property of the Document Type).
Pagination - What do the rest of pages look like?
- Once you've found the first page of a document, the subsequent pages are going to look very different from document to document. Some documents are highly structured where each page looks more or less the same, just with different data entered in each field. For example, most fillable forms. Some are less structured where there might be some consistency in certain parts of the document but from one document to another it's composition or length may change drastically. For example, invoices. The header of a company's invoice looks more or less the same from document to document. You'll always find the invoice date, invoice number, and other information in the same place. But the line items are going to be very different, depending on what was ordered causing them to be a variable page length. Some are highly unstructured, using sentences and paragraphs to detail information rather than structured form fields. For example, letters and contracts.
- The Pagination property of a Document Type will alter how the ESP Auto Separation provider considers including subsequent pages behind the detected first page of a document in the Batch Folder. Each option behaves differently to account for this variety in document pagination structure.

Optional Functionality

The EPI Extractor - Using page numbers to help separation decision making.
- If page numbers are present on a document, and Grooper can locate those page numbers using an extractor, ESP Auto Separation can use these numbers to help make decisions about where separation can occur. If ten pages in a row have a page number on them, you know that document should be ten pages long. With an EPI Extractor set on the Content Model, ESP will also know to take page numbers into consideration when establishing Batch Folders in the Batch.
Attachment Rules - Appending or prepending one Document Type to another.

Document Types should be set up in a Content Model complete with trained examples and/or rules for classification defined. The Separation properties of the Document Types are critically important to ESP Auto Separation.

Here will be defined whether the document pagination is:

This will impact how classified pages are foldered into documents.

Optionally, ESP allows you to create rules to combine "Attachment Type" documents with classified documents. Attachment Types are Document Types which are appended or prepended to other Document Types during separation.

Separation Properties

Once you create a Document Type

Training Scope

Adjusting the Training Scope provides benefits to the accuracy and performance of ESP Auto Separation by focusing what is important when it comes time to separate and classify Unstructured paginated documents. For example, the Normal mode will create a single Form Type and divide trained examples into "First", "Middle" and "Last" pages. From individual document to individual document, often the most meaningful features composing them are found on the first and last pages, and there can be more variance on the pages in between. This is different from the previous approach, which created individual Form Types for each trained example, each with their own "Page X of X" Page Type objects. This unifies all trained examples into a single Form Type, making the training and classification of these documents ultimately simpler and more efficient. The FirstLast mode assumes meaningful features for classification are only found on the first and last pages, with the middle pages containing no information needed to make a separation or classification decision. With this mode enabled, only trained examples of the first and last page and their associated features will be saved. This can improve processing time by removing all the features in the middle pages for consideration. The FirstOnly mode narrows this scope even further by only storing features from the first page of trained documents.

Writer's Note: While these properties are available on Structured, Fixed, and Extended Pagination types, these Training Scopes are almost entirely pertinent to the Unstructured Pagination type. If you use these Training Scopes on anything besides an Unstructured Document Type, you may see they always are trained according to the "Normal" scope.

Normal - This is the classic version of capturing training features for a document type.

This mode differs slightly for Unstructured Pagination. Instead of trained middle pages (ie Page 2 of 4 and Page 3 of 4) created as individual PageType objects of their multipage Form Type parents, they are created as "Middle" Page Type objects of a single Form Type. Furthermore last pages (ie 4 of 4 or 7 of 7) are combined into a single "Last" Page Type object.
This cleans up the Node Tree by creating a single unstructured Form Type for documents of various page lengths, instead of many. This will speed up runtime classification and can potentially yield more accurate results.

FirstLast - This is handy for training only the first and last page of a document type. It lowers the features training requirements, improves speed, and allows middle pages between the first and last page to be combined with the first and last.

Note: This approach assumes the middle pages will not be classified as confident first pages of a Document Type. For ESP Auto Separation. When a page is classified as the first page of a document, it a high priority indicator as the start of a new document.

FirstOnly - If someone has used "Document Titles" extraction with a Positive Extractor for separation in the past, consider this property an upgrade to that approach.

A common approach to document classification is to write a positive rule extractor locating the document's title on the first page of the document. However, this approach can break down when an image is poor quality, causing the Document Type's title extractor to miss the title.
The FirstOnly mode can allow the continued capture of features in titles combined with other trained features on the page. That way, if the title is missed, the separation engine can fall back on other features from trained first pages of the Document Type.
This approach is highly similar to the Extended Pagination in concept, where the first page is used as the separation point and all pages after are included in the document folder until another confident first page is found. This approach differs in that only the first page is trained, eliminating the weighting data from the secondary pages. This can improve runtime classification speed greatly.

Repeating Last Page

Contracts that contain signature pages being copied and distributed to involved parties and then signed and returned to be stored with the Contract document should use this feature.
Any document type that may or will have a duplicate last page.

Secondary Page Extractor

False-positive classification frequently happens on pages besides the first page of a document. In other words, middle pages and last pages, or secondary pages.
This property allows you to use an extractor to configure rules when to attach a secondary page to a particular Document Type whose first page has already been identified during ESP Auto Separation.
Use on multi-page Document Types using Unstructured Pagination. You can configure a Secondary Page Extractor on any To take advantage of this, configure a Positive Extractor in the Classification section and configure the Secondary Page Extractor to identify the second page. Cheating’s allowed in Grooper.

Examples

Separation will vary wildly between document types. Here are some real-world configurations using the new separation options. Some Classification Features are configured in the images. It’s common for Classification and Separation to be configured simultaneously. The explanations for each image consider classification and separation for runtime operations.

Example Number OneExample Number TwoExample Number ThreeExample Number FourExample Number Five

	Positive Extractor A Rule-Based Classification will occur and fallback on Features-Based Classification. Pagination It says this document can have any number of pages. Prioritize EPI This respects EPI if it is present, otherwise rely on EPI training from Features training. Secondary Page Extractor It determines if the page isn’t the first or last page of the document. If this extractor has a result, Separation appends this page to the page above it and understands the next page will be another secondary page, last page, or the start of a new document. The Content Model has an EPI Extractor configured. not shown here

	Allow Training No lexical features (as set on the Content Model) are considered for classification or separation. Positive Extractor Separation will only occur when the Positive Extractor has a result. Rule-Based Classification only. Pagination Automatically appends a specified number of pages using training. Repeating Last Page In considering Structured Pagination, and this property set to True, repeating last pages will be appended. Separation will separate using page-for-page trained samples only. Copies of the last page, such as signed pages from multiple parties, may exist and will be appended.

	Allow Training Lexical features (as set on the Content Model) are considered for classification and separation. Positive Extractor A Rule-Based Classification will occur and fallback on Features-Based Classification. Negative Extractor A result from this extractor will exclude this Document Type as a classification option. Pagination This document can have any number of pages. Training Scope The features on page 1 will be the only features saved in training. If Rules-Based Classification fails, only the first page’s features of a trained sample are used. When separation occurs and detects page 1 of this document type, all proceeding pages will be appended until another recognized document type is identified. The Content Model has an EPI Extractor configured and creates a hybrid of Rule-Based and Feature-Based Classification. not shown here

	Allow Training Lexical features (as set on the Content Model) are considered for classification and separation. Positive Extractor A Rule-Based Classification will occur and fallback on Features-Based Classification. Negative Extractor A result from this extractor will exclude this Document Type as a classification option. Pagination This document can have any number of pages. Prioritize EPI Respect EPI if it is present, otherwise rely on EPI training from Features training. Secondary Page Extractor - It determines if the page isn’t the first or last page of the document. If this extractor has a result, Separation appends this page to the page above it and understands the next page will be another secondary page, last page, or the start of a new document. The Content Model has an EPI Extractor configured and creates a hybrid of Rule-Based and Feature-Based Classification. not shown here

	Allow Training Lexical features (as set on the Content Model) are considered for classification and separation. Positive Extractor A Rule-Based Classification will occur and fallback on Features-Based Classification. Pagination This document can have any number of pages. Secondary Page Extractor - It determines if the page isn’t the first or last page of the document. If this extractor has a result, Separation appends this page to the page above it and understands the next page will be another secondary page, last page, or the start of a new document. The Content Model has an EPI Extractor configured and creates a hybrid of Rule-Based and Feature-Based Classification. not shown here