ESP Auto Separation (Separation Provider)

From Grooper Wiki
WIP This article is a work-in-progress. It may be unfinished. It may abruptly stop in the middle of a section. It may contain inaccurate information.


ESP Auto Separation is one of Grooper's Separation Providers used for document separation. It leverages several different aspects of documents to determine where one document starts and the next begins in a Batch of loose pages, including classification data, the documents pagination structure, extracted page numbers, and rules for merging one Document Type with another. ESP Auto Separation is also one of the few Separation Providers that both separates and classifies documents at the same time, during the Separate activity.

ESP Auto Separation (often referred to simply as ESP) is often seen as the most effort intensive Separation Provider. It is a highly configurable provider (And, not all that configuration is done on the Separate step or a Separation Profile. Most of its functionality is actually determined by the associated Content Model's configuration). However, it is often the solution for the most complicated separation and classification challenges. ESP is extremely useful for document sets with a variety of structured, semi-structured, and unstructured documents.

About

There are four main components to ESP Auto Separation. Two of them are "core" functionalities, which are critical to understanding where the provider establishes separation points in a Batch. Two of them are "optional" functionalities, which can change the normal separation logic somewhat.

Core Functionality

  1. Classification Data - Where's the first page of a document?
    • The very basic idea behind ESP's core functionality is finding the first page of a document. In a very basic sense, if you know where the first page of a document is, you know where it starts. Where does it stop? Once you find the next first page of a document. How it finds the first page of a document is determined through trained examples of a Document Type using the Lexical classification method (or, in some configurations, the Positive Extractor property of the Document Type).
      • Document Types are set up in a Content Model. Document examples are trained using the "Classification Testing" tab. Training data is stored as one or more Form Type objects as children of the trained document's Document Type.
  2. Pagination - What do the rest of pages look like?
    • Once you've found the first page of a document, the subsequent pages are going to look very different from document to document.
      • Some documents are highly structured where each page looks more or less the same, just with different data entered in each field. For example, most fillable forms.
      • Some are less structured where there might be some consistency in certain parts of the document but from one document to another it's composition or length may change drastically. For example, invoices. The header of a company's invoice looks more or less the same from document to document. You'll always find the invoice date, invoice number, and other information in the same place. But the line items are going to be very different, depending on what was ordered causing them to be a variable page length.
      • Some are highly unstructured, using sentences and paragraphs to detail information rather than structured form fields. For example, letters and contracts.
    • The Pagination property of a Document Type will alter how the ESP Auto Separation provider considers including subsequent pages behind the detected first page of a document in the Batch Folder. Each option behaves differently to account for this variety in document pagination structure.

Optional Functionality

  1. The EPI Extractor - Using page numbers to help separation decision making.
    • If page numbers are present on a document, and Grooper can locate those page numbers using an extractor, ESP Auto Separation can use these numbers to help make decisions about where separation can occur. If ten pages in a row have a page number on them, you know that document should be ten pages long. With an EPI Extractor set on the Content Model, ESP will also know to take page numbers into consideration when establishing Batch Folders in the Batch.
    • In some cases, extracted page numbers can be used to override the provider's normal separation logic as well when secondary pages are misclassified as first pages, resulting in multiple folders for a single document. You can prioritize page numbers over first page classification results and respect the page number sequence instead.
  2. Attachment Rules - Appending or prepending one Document Type to another.
    • Sometimes, it can be useful to classify a Document Type as a document that should be merged with another document. The Attachment Rules settings of the ESP Auto Separation provider can append or prepend a Document Type to one or more Document Types if it comes before or after. For example, contracts often have "Exhibits" after them that should be attached to the contract. If you have an "Exhibit" Document Type coming after a "Contract" Document Type in a Batch, you can set up Attachment Rules to append the exhibit to the contract, resulting in a single document.

Separation Properties of a Document Type

The Separation properties of the Document Types in a Content Model are critically important to ESP Auto Separation. These properties drastically impact how ESP Auto Separation folders Batch Pages into document Batch Folders in a Batch.

FYI These properties only apply to ESP Auto Separation. No other Separation Provider makes use of these properties. If you are not using ESP Auto Separation to separate loose pages into document folders, you can completely ignore this set of properties.

Pagination

First and foremost, this is where you define the Pagination property of a document. This can be one of four options:

  • Structured
  • Fixed
  • Extended
  • Unstructured

Each option impacts how secondary pages are foldered after the initial separation point is established (by matching the first page of the Document Type. Or more specifically, a "Page 1" Page Type of a Form Type of the Document Type).

Form Types and Page Types

To understand these Pagination types, it's important to understand the Form Type object. This is an often overlooked object in Grooper. The basic function of a Form Type is to store the training data for trained example documents. When training documents using the Lexical Classification Method', the TF-IDF weighting values of text features are stored on a Form Type as well as a copy of the trained document's page images (If nothing else so you can visually inspect what document you trained!).


For standard document classification, the Classify activity runs on the Batch Folder level. The entire document's text features are compared to the entire text features of trained example documents.


ESP Auto Separation works a little differently. Yes, classification data is used to establish separation points. However, instead of performing folder level classification, it performs page level classification.

This is where a Form Type's child objects come into play: the Page Type. The Page Type objects are critical for making separation decisions during ESP Auto Separation.


Most importantly, for all Pagination types, are the trained examples of a "Page 1" Page Type. ESP will look to the training data on these objects to establish the primary separation points. Whenever a loose page matches a "Page 1" Page Type, the ESP Auto Separation provider will create a new Batch Folder for the new document.


The different Pagination types determine if the subsequent secondary pages are appended to the document folder. Each one makes the determination a little differently, allowing you to alter the document separation depending on the Document Type's pagination structure.


Some Pagination types require the secondary pages to match the trained example Page Types more explicitly than others. Some are more forgiving to variable length documents or unconfident secondary page matches.

Structured

If a Document Type has a Structured pagination, documents should match a trained example page-for-page during ESP Auto Separation.

This is applicable for documents whose form or structure is always set. They will always have the same number of pages as the trained example (saved as a Document Type's Form Type). Furthermore, each page should look more or less the same from document to document. All first pages should look the same. All second pages should look the same. And so on.

These typically are standardized forms. For example, government forms such as a W-4 withholding form. At least for a given year, all those forms will be the same number of pages and structure. Aside from the specific information filled in each field, one W-4 is no different from the other.

FYI While the document should match page-for-page using the Structured Pagination type, Grooper does build in some "wiggle room". You will find cases during ESP Auto Separation where a Structured Document Type will include a page (sometimes even an extra page) that does not confidently match one of the secondary Page Type examples. This is to allow better document separation when confronted with poor OCR results, resulting in poor classification matches. Or, to allow for some variance in document structure where additional pages may or may not be present.

However, the Structured Pagination type should not include pages that confidently match "Page 1" examples as secondary pages. "Page 1" matches are generally "fixed points" for folder creation during ESP Auto Separation. In almost all cases, when a page matches a "Page 1" example of a Document Type, a new folder will be created.

Fixed

The Fixed Pagination option separates documents according to a set number of pages during ESP Auto Separation. Fixed paginated documents can be considered a special variety of structured documents. Documents with a Structured pagination are expected to match (more or less) page for page. So, the first page should match a trained first page, the second a trained second page, and so on.

Fixed pagination differs in two ways.

  1. The page count is always the same (or fixed), specified by the Page Count property.
    • For the Structured pagination, trained examples are expected to match page-for-page according to the page length of each example. One trained example may be a four pages, resulting in a Form Type with four Page Types. One may be seven. The Structured pagination allows for variance, as long as the pages match one or the other Form Type
    • Fixed pagination is used for documents expected to be the same number of pages every time. If a company's job application form is always five pages long, a "Job Application" Document Type should be created with a Fixed pagination and a page count of "5".
  2. Only the first page needs to match to create a new Batch Folder and folder the subsequent pages.
    • Once a positive first page is matched, the remaining number of pages in the page count will be included in the folder (unless one of those pages positively matches a first page of the same or another Document Type, in which case a new Batch Folder is created).
      • Even if the secondary pages are classified as secondary pages of another Document Type (or not classified at all), as long as they are not the first page of a Document Type, they will be included in the Batch Folder

Extended

Extended pagination can aid separation of documents of variable length during ESP Auto Separation. Once a page matches the trained first page of an Extended paginated Document Type, a new Batch Folder is created. Subsequent pages are included in the folder until a new first page of a Document Type is found. This can result in a single or multiple page document. Batch Pages are placed in the new folder until one classifies as a first page of a Document Type (at which point a new Batch Folder is created).

Unstructured

Document Types with an Unstructured pagination are used to separate documents with a unstructured form and variable length during ESP Auto Separation. Unstructured documents differ from structured ones in that the semantic information, while present, isn't always found in the same location and the same way across multiple documents. They use sentences and paragraphs rather than fixed or semi-fixed fields to convey and record information.

Contracts are an example of unstructured documents. While there are similarities across types of contracts (an Oil and Gas Lease Agreement for example), how and where that information is presented varies from one contract to another. There may be different clauses in the contract in different locations from one to another or removed altogether. One contract may be one page, one may be three pages, one may be twenty, but all are the same kind of contract.

During separation, a new folder will be created if it matches the first page of trained example documents. The document will extend to include all pages meeting the minimum confidence similarity for the middle and last pages of the trained examples of the Document Type.

The Unstructured Pagination type creates Form Type objects and their child Page Type objects very differently from the other Pagination types. Instead of a Form Type for each trained example for each page length, each with their own "Page 1", "Page 2", etc Page Type objects, a single Form Type with three Page Types are created: "First", "Middle", and "Last". For unstructured Document Types, what usually makes them unique is found on the first and last pages. The middle pages tend to blur from document to document. For example, Page 2 might look like Page 7 of another document, but their Page 1 and last pages will look more or less the same. Using this feature of unstructured documents, we have found placing all example pages into three categories, "First", "Middle" and "Last" of a single Form Type both improves classification results and processing efficiency.

  • Note: This functionality is changed somewhat by the Training Scope property, to be discussed later.

**Prioritize EPI**

**Training Scope**

**Secondary Page Extractor**

Other Separation Properties

Training Scope

Adjusting the Training Scope provides benefits to the accuracy and performance of ESP Auto Separation by focusing what is important when it comes time to separate and classify Unstructured paginated documents. For example, the Normal mode will create a single Form Type and divide trained examples into "First", "Middle" and "Last" pages. From individual document to individual document, often the most meaningful features composing them are found on the first and last pages, and there can be more variance on the pages in between. This is different from the previous approach, which created individual Form Types for each trained example, each with their own "Page X of X" Page Type objects. This unifies all trained examples into a single Form Type, making the training and classification of these documents ultimately simpler and more efficient. The FirstLast mode assumes meaningful features for classification are only found on the first and last pages, with the middle pages containing no information needed to make a separation or classification decision. With this mode enabled, only trained examples of the first and last page and their associated features will be saved. This can improve processing time by removing all the features in the middle pages for consideration. The FirstOnly mode narrows this scope even further by only storing features from the first page of trained documents.

Writer's Note: While these properties are available on Structured, Fixed, and Extended Pagination types, these Training Scopes are almost entirely pertinent to the Unstructured Pagination type. If you use these Training Scopes on anything besides an Unstructured Document Type, you may see they always are trained according to the "Normal" scope.

Normal - This is the classic version of capturing training features for a document type.

  • This mode differs slightly for Unstructured Pagination. Instead of trained middle pages (ie Page 2 of 4 and Page 3 of 4) created as individual PageType objects of their multipage Form Type parents, they are created as "Middle" Page Type objects of a single Form Type. Furthermore last pages (ie 4 of 4 or 7 of 7) are combined into a single "Last" Page Type object.
  • This cleans up the Node Tree by creating a single unstructured Form Type for documents of various page lengths, instead of many. This will speed up runtime classification and can potentially yield more accurate results.

FirstLast - This is handy for training only the first and last page of a document type. It lowers the features training requirements, improves speed, and allows middle pages between the first and last page to be combined with the first and last.

  • Note: This approach assumes the middle pages will not be classified as confident first pages of a Document Type. For ESP Auto Separation. When a page is classified as the first page of a document, it a high priority indicator as the start of a new document.

FirstOnly - If someone has used "Document Titles" extraction with a Positive Extractor for separation in the past, consider this property an upgrade to that approach.

  • A common approach to document classification is to write a positive rule extractor locating the document's title on the first page of the document. However, this approach can break down when an image is poor quality, causing the Document Type's title extractor to miss the title.
  • The FirstOnly mode can allow the continued capture of features in titles combined with other trained features on the page. That way, if the title is missed, the separation engine can fall back on other features from trained first pages of the Document Type.
  • This approach is highly similar to the Extended Pagination in concept, where the first page is used as the separation point and all pages after are included in the document folder until another confident first page is found. This approach differs in that only the first page is trained, eliminating the weighting data from the secondary pages. This can improve runtime classification speed greatly.

Repeating Last Page

  • Contracts that contain signature pages being copied and distributed to involved parties and then signed and returned to be stored with the Contract document should use this feature.
  • Any document type that may or will have a duplicate last page.

Secondary Page Extractor

  • False-positive classification frequently happens on pages besides the first page of a document. In other words, middle pages and last pages, or secondary pages.
  • This property allows you to use an extractor to configure rules when to attach a secondary page to a particular Document Type whose first page has already been identified during ESP Auto Separation.
  • Use on multi-page Document Types using Unstructured Pagination. You can configure a Secondary Page Extractor on any To take advantage of this, configure a Positive Extractor in the Classification section and configure the Secondary Page Extractor to identify the second page. Cheating’s allowed in Grooper.