ESP Auto Separation

From Grooper Wiki
Jump to navigation Jump to search

ESP Auto Separation is one of Grooper's Separation Providers used for document separation. It leverages several different aspects of documents to determine where one document starts and the next begins in a Batch of loose pages, including classification data, the documents pagination structure, extracted page numbers, and rules for merging one Document Type with another. ESP Auto Separation is also one of the few Separation Providers that both separates and classifies documents at the same time, during the Separate activity.

ESP Auto Separation (often referred to simply as ESP) is often seen as the most effort intensive Separation Provider. It is a highly configurable provider (And, not all that configuration is done on the Separate step or a Separation Profile. Most of its functionality is actually determined by the associated Content Model's configuration). However, it is often the solution for the most complicated separation and classification challenges. ESP is extremely useful for separating document sets with a variety of structured, semi-structured, and unstructured documents.

About

There are four main components to ESP Auto Separation. Two of them are "core" functionalities, which are critical to understanding where the provider establishes separation points in a Batch. Two of them are "optional" functionalities, which can change the normal separation logic somewhat.

Core Functionality

  1. Classification Data - Where's the first page of a document?
    • The very basic idea behind ESP's core functionality is finding the first page of a document. In a very basic sense, if you know where the first page of a document is, you know where it starts. Where does it stop? Once you find the next first page of a document. How it finds the first page of a document is determined through trained examples of a Document Type using the Lexical classification method (or, in some configurations, the Positive Extractor property of the Document Type).
      • Document Types are set up in a Content Model. Document examples are trained using the "Classification Testing" tab. Training data is stored as one or more Form Type objects as children of the trained document's Document Type.
  2. Pagination - What do the rest of pages look like?
    • Once you've found the first page of a document, the subsequent pages are going to look very different from document to document.
      • Some documents are highly structured where each page looks more or less the same, just with different data entered in each field. For example, most fillable forms.
      • Some are less structured where there might be some consistency in certain parts of the document but from one document to another it's composition or length may change drastically. For example, invoices. The header of a company's invoice looks more or less the same from document to document. You'll always find the invoice date, invoice number, and other information in the same place. But the line items are going to be very different, depending on what was ordered causing them to be a variable page length.
      • Some are highly unstructured, using sentences and paragraphs to detail information rather than structured form fields. For example, letters and contracts.
    • The Pagination property of a Document Type will alter how the ESP Auto Separation provider considers including subsequent pages behind the detected first page of a document in the Batch Folder. Each option behaves differently to account for this variety in document pagination structure.

Optional Functionality

  1. The EPI Extractor - Using page numbers to help separation decision making.
    • If page numbers are present on a document, and Grooper can locate those page numbers using an extractor, ESP Auto Separation can use these numbers to help make decisions about where separation can occur. If ten pages in a row have a page number on them, you know that document should be ten pages long. With an EPI Extractor set on the Content Model, ESP will also know to take page numbers into consideration when establishing Batch Folders in the Batch.
    • In some cases, extracted page numbers can be used to override the provider's normal separation logic as well when secondary pages are misclassified as first pages, resulting in multiple folders for a single document. You can prioritize page numbers over first page classification results and respect the page number sequence instead.
  2. Attachment Rules - Appending or prepending one Document Type to another.
    • Sometimes, it can be useful to classify a Document Type as a document that should be merged with another document. The Attachment Rules settings of the ESP Auto Separation provider can append or prepend a Document Type to one or more Document Types if it comes before or after. For example, contracts often have "Exhibits" after them that should be attached to the contract. If you have an "Exhibit" Document Type coming after a "Contract" Document Type in a Batch, you can set up Attachment Rules to append the exhibit to the contract, resulting in a single document.

Separation Properties of a Document Type

The Separation properties of the Document Types in a Content Model are critically important to ESP Auto Separation. These properties drastically impact how ESP Auto Separation folders Batch Pages into document Batch Folders in a Batch.

FYI These properties only apply to ESP Auto Separation. No other Separation Provider makes use of these properties. If you are not using ESP Auto Separation to separate loose pages into document folders, you can completely ignore this set of properties.

Esp-auto-separation-about-01.png

Pagination

First and foremost, this is where you define the Pagination property of a document. This can be one of four options:

  • Structured
  • Fixed
  • Extended
  • Unstructured

Each option impacts how secondary pages are foldered after the initial separation point is established (by matching the first page of the Document Type. Or more specifically, a "Page 1" Page Type of a Form Type of the Document Type).

Esp-auto-separation-about-02.png

Form Types and Page Types

To understand these Pagination types, it's important to understand the Form Type object. This is an often overlooked object in Grooper. The basic function of a Form Type is to store the training data for trained example documents. When training documents using the Lexical Classification Method', the TF-IDF weighting values of text features are stored on a Form Type as well as a copy of the trained document's page images (If nothing else so you can visually inspect what document you trained!).


For standard document classification, the Classify activity runs on the Batch Folder level. The entire document's text features are compared to the entire text features of trained example documents.


ESP Auto Separation works a little differently. Yes, classification data is used to establish separation points. However, instead of performing folder level classification, it performs page level classification.

Esp-auto-separation-about-03.png

Esp-auto-separation-about-04.png

This is where a Form Type's child objects come into play: the Page Type. The Page Type objects are critical for making separation decisions during ESP Auto Separation.


Most importantly, for all Pagination types, are the trained examples of a "Page 1" Page Type. ESP will look to the training data on these objects to establish the primary separation points. Whenever a loose page matches a "Page 1" Page Type, the ESP Auto Separation provider will create a new Batch Folder for the new document.


The different Pagination types determine if the subsequent secondary pages are appended to the document folder. Each one makes the determination a little differently, allowing you to alter the document separation depending on the Document Type's pagination structure.


Some Pagination types require the secondary pages to match the trained example Page Types more explicitly than others. Some are more forgiving to variable length documents or unconfident secondary page matches.

Esp-auto-separation-about-05.png

Esp-auto-separation-about-06.png

Structured

If a Document Type has a Structured pagination, documents should match a trained example page-for-page during ESP Auto Separation.

This is applicable for documents whose form or structure is always set. They will always have the same number of pages as the trained example (saved as a Document Type's Form Type). Furthermore, each page should look more or less the same from document to document. All first pages should look the same. All second pages should look the same. And so on.

These typically are standardized forms. For example, government forms such as a W-4 withholding form. At least for a given year, all those forms will be the same number of pages and structure. Aside from the specific information filled in each field, one W-4 is no different from the other.

Imagine you have two different Document Types, the "Blue" Document Type and the "Orange" Document Type. You train one example document for each Document Type.

"Blue" documents are four page documents. "Orange" documents are two page documents.

Both have their Pagination property set to Structured.

Esp-auto-separation-about-07.png

The ESP Auto Separation provider classifies each page according to the training data for each Document Type. The first page in the Batch matches the "Page 1" example of the "Blue" Document Type.

When a page in a batch matches a trained example's first page, ESP uses this as a separation point to create a new Batch Folder'

Esp-auto-separation-about-08.png

At this point, the provider establishes where the document's Batch Folder should be inserted.


How the subsequent secondary pages are appended to the folder is determined by the Pagination type chosen for the Document Type.

Esp-auto-separation-about-09.png

The Structured pagination type assumes the document should match the trained examples page-for-page.

Here, the subsequent three pages match the subsequent "Blue" example pages. Page 2 matches Page 2. Page 3 matches Page 3. Page 4 matches Page 4.

Esp-auto-separation-about-10.png

Since each of the following pages in the Batch matches each of the following example document's pages, the pages are applied to the document folder.

We now have officially separated four loose pages into a document folder using page level classification!

Esp-auto-separation-about-11.png

Furthermore, since classification data is used to make page by page separation decisions, Grooper can go one step further. As well as establishing the separation points for folder creation, the ESP Auto Separation provider will also classify the created folders as well.

Since these pages were foldered using the "Blue" Document Types training data. The folder is assigned the "Blue" Document Type.

Esp-auto-separation-about-12.png

The ESP provider continues through the Batch page by page, looking for positive first page matches to establish separation points for folder creation.


As a Structured paginated Document Type and the subsequent page matching the second page of the "Orange" trained document, the page is applied to new folder.


After all qualifying pages have been appended to the new folder, it is assigned the "Orange" Document Type, used to classify Page 1 of the document.

Esp-auto-separation-about-13.png


FYI While the document should match page-for-page using the Structured Pagination type, Grooper does build in some "wiggle room".
  1. Secondary pages are allowed to be out of order. If the example document's pages follow a sequence of "1, 2, 3, 4", but the pages in the batch follow a sequence of "1, 4, 3, 2", the pages will still be placed in a new document folder. The important thing is that the first page in the sequence is Page 1, and the following three pages match one of the three secondary pages, not necessarily that they match in that order.
  1. You will also find cases during ESP Auto Separation where a Structured Document Type will include a page (sometimes even an extra page) that does not confidently match one of the secondary Page Type examples. This is to allow better document separation when confronted with poor OCR results, resulting in poor classification matches. Or, to allow for some variance in document structure where additional pages may or may not be present.
However, the Structured Pagination type should not include pages that confidently match "Page 1" examples as secondary pages. "Page 1" matches are generally "fixed points" for folder creation during ESP Auto Separation. Except in certain cases where you specifically choose to override this behavior, when a page matches a "Page 1" example of a Document Type, a new folder will always be created.

Repeating Last Page

The Structured pagination type has a unique configurable property to account for documents who may have a last page that repeats. For example, some forms may have a signature page sent to multiple parties to sign and send back to the sender. The Repeating Last Page property can be set to True when a Document Type has duplicate last pages. This will instruct ESP Auto Separation to extend the document folder, appending these repeating last pages.

Fixed

The Fixed Pagination option separates documents according to a set number of pages during ESP Auto Separation. Fixed paginated documents can be considered a special variety of structured documents. Documents with a Structured pagination are expected to match (more or less) page for page. So, the first page should match a trained first page, the second a trained second page, and so on.

Fixed pagination differs in two ways.

  1. The page count is always the same (or fixed), specified by the Page Count property.
    • For the Structured pagination, trained examples are expected to match page-for-page according to the page length of each example. One trained example may be a four pages, resulting in a Form Type with four Page Types. One may be seven, resulting in a second Form Type with seven Page Type objects. The Structured pagination allows for variance, as long as the pages match one or the other Form Type
    • Fixed pagination is used for documents expected to be the same number of pages every time. If a company's job application form is always five pages long, a "Job Application" Document Type should be created with a Fixed pagination and a page count of "5".
  2. Only the first page needs to match to create a new Batch Folder and folder the subsequent pages.
    • Once a positive first page is matched, the remaining number of pages in the page count will be included in the folder (unless one of those pages positively matches a first page of the same or another Document Type, in which case a new Batch Folder is created).
      • Even if the secondary pages are classified as secondary pages of another Document Type (or not classified at all), as long as they are not the first page of a Document Type, they will be included in the Batch Folder

Imagine we have our "Blue" and "Orange" Document Types. Both of them have their Pagination property set to Structured.


However, in this case, the "?" pages after the Blue Page 2, are not confidently classified. They may have missed the classification threshold for a Page 3 or Page 4. They may just be random pages not clearly modeled by training data but should be included in the folder with Page 1 and 2 above them.

Esp-auto-separation-about-14.png

If set to Structured, the ESP Auto Separation provider will still create a folder for the "Blue" document, using the matching Page 1 as the initial separation point. Page 2 also matches Page 2 of the "Blue" example document. So, it is also included in the folder.


However, the two subsequent "?" pages do not match anything and are left as loose pages.


The Orange pages do match the trained example document and separate just fine.

Esp-auto-separation-about-15.png

If instead, you choose Fixed for the Pagination type and set its Page Count to 4, these pages are separated quite differently.


Once the initial separation point is established by matching the first page of the "Blue" Document Type's trained first pages, it doesn't matter that the subsequent pages did not confidently classify as secondary pages of the "Blue" Document Type. The folder will simply consume up to four pages, until another page matches a Page 1 of a Document Type.


This gives us a four page document, even though only the first two pages match the trained example for the "Blue" Document Type.

Esp-auto-separation-about-16.png

Be aware that the Fixed pagination will append pages up to the assigned Page Count.


Take this Batch for example. Assume we've set the "Blue" Document Type to the Fixed Pagination with a Page Count of 4.

Esp-auto-separation-about-17.png

ESP Auto Separation will only create a Batch Folder with three Batch Pages instead of four.


The page after the "?" page matches a first page example of a Document Type. This supercedes the Page Count. ESP considers a confident Page 1 match to be more important to create the new folder, rather than appending it to the 4 page Fixed Document Type above it.

Esp-auto-separation-about-18.png

This also means document folders created with the Fixed Pagination type cannot exceed the Page Count assigned.

Here, there is a page confidently matching the Blue Page 4 of the trained example document. However, if falls outside of the Page Count set at 4.

Esp-auto-separation-about-19.png

Notice the Blue Page 4 in the Batch is left out of the folder. It's just a loose page in the Batch after separation. Even though it was classified accurately as "Blue Page 4", placing it in the folder would make it a five page document. This would exceed the Page Count of 4 for this Fixed Pagination configuration.

Esp-auto-separation-about-20.png

Extended

Extended pagination can aid separation of documents of variable length during ESP Auto Separation. Once a page matches the trained first page of an Extended paginated Document Type, a new Batch Folder is created. Subsequent pages are included in the folder until a new first page of a Document Type is found. This can result in a single or multiple page document. Batch Pages are placed in the new folder until one classifies as a first page of a Document Type (at which point a new Batch Folder is created).

The Extended pagination can be very useful for documents of variable length, such as semi-structured documents like invoices. Generally, the first page of semi-structured documents will be different in enough ways from the secondary pages that it can be easily classified as a first page. Then, it doesn't necessarily matter what comes next in the batch. Until ESP matches a page to another Page 1 Page Type of a Document Type, it will keep adding pages to the created folder.

This Pagination type can also be useful for forms that can have a lot of extra documents optionally attached to them. For example, certain types of application may have optional reference material attached to them. If you don't want to classify that material as its own type of document, but know it always follows the application, you can use the Extended mode to just attach those unclassified pages to the properly classified and foldered application.

In this Batch, there are two "Orange" first pages.


After the first one, at the top of the Batch, there's a secondary page of the "Orange" Document Type followed by a bunch of unclassified pages, and a secondary page of the "Blue" Document Type as well. And then finally there's another first page of an "Orange" document.


Imagine we set the "Orange" Document Type to use the Extended Pagination mode in this case.

Esp-auto-separation-about-21.png

For Extended mode, the thing that really matters for folder creation is a confident first page match. Until another confident first page match is found for another document, it will keep applying pages to the folder.


This can result in multi-page documents, like the first "Orange" document, or a single page document like the second "Orange".


Note: The Extended mode will also folder secondary pages of another document type as well (Such as the "Blue" Page 2 here). This can be good or bad depended on the situation. Cases where it is bad will be when a page following the Extended separation point fails to classify as a first page of a Document Type. Without a confident first page match, ESP will fail to create the new separation point, and that page will get lumped in with the Extended document folder.

Esp-auto-separation-about-22.png

Trigger On Any Page

The Extended pagination type has a unique configurable property to trigger folder creation on any page from the training example, not just the first page. With Trigger On Any Page set to True, a new folder will be created when any page matches the training data, regardless if it matches the first page of the example document or any other page.

Unstructured

Document Types with an Unstructured pagination are used to separate documents with a unstructured form and variable length during ESP Auto Separation. Unstructured documents differ from structured ones in that the semantic information, while present, isn't always found in the same location and the same way across multiple documents. They use sentences and paragraphs rather than fixed or semi-fixed fields to convey and record information.

Contracts are an example of unstructured documents. While there are similarities across types of contracts (an Oil and Gas Lease Agreement for example), how and where that information is presented varies from one contract to another. There may be different clauses in the contract in different locations from one to another or removed altogether. One contract may be one page, one may be three pages, one may be twenty, but all are the same kind of contract.

During separation, a new folder will be created if it matches the first page of trained example documents. The document will extend to include all pages meeting the minimum confidence similarity for the middle and last pages of the trained examples of the Document Type.

The Unstructured Pagination type creates Form Type objects and their child Page Type objects very differently from the other Pagination types. Instead of a Form Type for each trained example for each page length, each with their own "Page 1", "Page 2", etc Page Type objects, a single Form Type with three Page Types are created: "First", "Middle", and "Last". For unstructured Document Types, what usually makes them unique is found on the first and last pages. The middle pages tend to blur from document to document. For example, Page 2 might look like Page 7 of another document, but their Page 1 and last pages will look more or less the same. Using this feature of unstructured documents, we have found placing all example pages into three categories, "First", "Middle" and "Last" of a single Form Type both improves classification results and processing efficiency.

  • Note: This functionality is changed somewhat by the Training Scope property. Visit the Training Scope section of this article for more information.

Here, we have an "NDA" Document Type with the Pagination property set to Structured.

For this example, the "NDA" Document Type is configured to classify "Non-Disclosure Agreements". These are contracts wherein one party agrees not to disclose privleged information to the public. For example, an employee agreeing to not disclose proprietary business information when accepting a job from an employer.

We have three trained examples here, resulting in three Form Type objects. One is a four page contract, resulting in a Form Type with four Page Type objects. One is an eight page contract, resulting in a Form Type with eight Page Types. One is five pages, with a Form Type with five Page Types.

If these examples were trained with the Document Type's Pagination property set to Unstructured, the Form Types and Page Types will be created quite differently.

Esp-auto-separation-about-23.png

Here, the "NDA" Document Type is configured to use the Unstructured pagination type.

Notice, the Document Type only has a single Form Type created for all three trained examples. Furthermore, rather than individual Page Type objects for each trained page, we only have three: "First" "Middle" and "Last"

Esp-auto-separation-about-24.png

The "First" Page Type contains all the training data for the three first pages of each trained document.

Esp-auto-separation-about-25.png

The "Middle" Page Type contains all the training data for all middle pages for each trained document (any page not the first or last page of the document).

Esp-auto-separation-about-26.png

The "Last" Page Type contains all the training data for the three last pages of each trained document.

Esp-auto-separation-about-27.png

For most of the Pagination' options, you can switch freely between them. You can switch from Structured to Extended for example and the ESP Auto Separation provider will folder the documents according to the new pagination type without any further configuration.

This is not necessarily the case when switching to and from the Unstructured Pagination type.

Because the Unstructured option creates Form Type and Page Type objects in such a different way from the other paginations, you must purge these objects and their training data and recreate them.

  1. Here, we changed the Pagination property from Structured to Unstructured.
  2. However, notice the Form Types and Page Types remain the same. This can cause problems for ESP Auto Separation, inappropriately separating the documents of this Document Type.

We want to see the single Form Type with the "First" "Middle" and "Last" Page Types.

Esp-auto-separation-about-28.png

First thing first, you will want to purge this Document Type's training.

  1. Select the Document Type whose data you wish to purge.
    • Caution! Be sure you're selecting a Document Type and NOT the Content Model. If you purge at the Content Model's level, you will purge the training data (and Form Types and Page Types) for all Document Types in the Content Model.
  2. Press the "Purge Training" button.

Esp-auto-separation-about-29.png

Upon sucessfully purging the trianing data, you will see two things.

  1. This notification the purge was successful.
  2. The Document Type will no longer have any Form Types (since this is where training data "lives").

At this point, you will need to re-train this Document Type by either:

  1. Manually training documents using the "Classification Testing" tab of the Content Model
  2. Using the "Rebuild Training" button to train from the documents in the "Testing Batch" of the Content Model.

Esp-auto-separation-about-30.png

Prioritize EPI

The Unstructured pagination type has a unique configurable property to override the EPI Extractor's normal functionality when using extracted page numbers to assist document separation. Typically, the EPI Extractor simply helps extend the document, appending unclassified pages to the created folder as long as the follow a logical page number sequence. However, sometimes unstructured documents will have their second (or third etc) page incorrectly classified as the first page. This results in over-separation, breaking up a single document into two (or more) folders.

The Prioritize EPI property will prioritize extracted page numbers over classification results when set to True. That way, if the first page of the document classifies as a "First" page and creates a folder, but a secondary page also classifies as "First", the secondary page will be applied to the document folder, as long as the extracted page number sequence indicates a single document.

For more information on the EPI Extractor, visit the EPI Extractor section of this article.

Secondary Page Extractor

The Unstructured Pagination Type is also unique in that it has a Secondary Page Extractor option. If the extractor produces a result on a page, it will be classified as a "Middle" Page Type of the Document Type.

  • False-positive classification frequently happens on pages besides the first page of a document. In other words, middle pages and last pages, or secondary pages.
  • This property allows you to use an extractor to configure rules when to attach a secondary page to a particular Document Type whose first page has already been identified during ESP Auto Separation.
    • This will override classification results on any page following a page confidently matching a "First" Page Type of an Unstrucutred Document Type. Even if the page classifies as a "First" page, it will be assigned the "Middle" page level classification instead.
  • You can use this on multi-page Document Types using Unstructured Pagination as a "rules-based" approach to page level classification for unstructured documents. To take advantage of this, configure a Positive Extractor in the Classification section to classify pages as "First" Page Types, and configure the Secondary Page Extractor to identify the "Middle" (or secondary) Page Types. Cheating’s allowed in Grooper.

Training Scope

Adjusting the Training Scope provides benefits to the accuracy and performance of ESP Auto Separation by focusing what is important when it comes time to separate and classify Unstructured paginated documents. For example, the Normal mode will create a single Form Type and divide trained examples into "First", "Middle" and "Last" pages. From individual document to individual document, often the most meaningful features composing them are found on the first and last pages, and there can be more variance on the pages in between. This is different from the previous approach, which created individual Form Types for each trained example, each with their own "Page X of X" Page Type objects. This unifies all trained examples into a single Form Type, making the training and classification of these documents ultimately simpler and more efficient. The FirstLast mode assumes meaningful features for classification are only found on the first and last pages, with the middle pages containing no information needed to make a separation or classification decision. With this mode enabled, only trained examples of the first and last page and their associated features will be saved. This can improve processing time by removing all the features in the middle pages for consideration. The FirstOnly mode narrows this scope even further by only storing features from the first page of trained documents.

Writer's Note: While these properties are available on Structured, Fixed, and Extended Pagination types, the Training Scope options are only pertinent to the Unstructured Pagination type. Other Pagination types will simple train features according to the "Normal" option. Even if you choose a different Training Scope option, you may see their weightings are no different from the "Normal" weightings.

Normal - This is the classic version of capturing training features for a document type.

  • This mode differs slightly for Unstructured Pagination. Instead of trained middle pages (ie Page 2 of 4 and Page 3 of 4) created as individual PageType objects of their multipage Form Type parents, they are created as "Middle" Page Type objects of a single Form Type. Furthermore last pages (ie 4 of 4 or 7 of 7) are combined into a single "Last" Page Type object.
  • This cleans up the Node Tree by creating a single unstructured Form Type for documents of various page lengths, instead of many. This will speed up runtime classification and can potentially yield more accurate results.

FirstLast - This is handy for training only the first and last page of a document type. It lowers the features training requirements, improves speed, and allows middle pages between the first and last page to be combined with the first and last.

  • Note: This approach assumes the middle pages will not be classified as confident first pages of a Document Type. For ESP Auto Separation. When a page is classified as the first page of a document, it a high priority indicator as the start of a new document.

FirstOnly - If someone has used "Document Titles" extraction with a Positive Extractor for separation in the past, consider this property an upgrade to that approach.

  • A common approach to document classification is to write a positive rule extractor locating the document's title on the first page of the document. However, this approach can break down when an image is poor quality, causing the Document Type's title extractor to miss the title.
  • The FirstOnly mode can allow the continued capture of features in titles combined with other trained features on the page. That way, if the title is missed, the separation engine can fall back on other features from trained first pages of the Document Type.
  • This approach is highly similar to the Extended Pagination in concept, where the first page is used as the separation point and all pages after are included in the document folder until another confident first page is found. This approach differs in that only the first page is trained, eliminating the weighting data from the secondary pages. This can improve runtime classification speed greatly.


EPI Extractor

Often, you know pages are part of a single document because of the page numbers present on its pages. If you see a series of pages and find "Page 1 of 3" on the first, "Page 2 of 3" on the second, and "Page 3 of 3" on the third, that's a good indication each of those three pages are part of a single document.

The EPI Extractor functionality of the ESP Auto Separation provider follows that same logic. It uses a Data Type extractor configured to return page number information on each page. Even if pages do not meet classification requirements to be foldered into a created document folder, as long as their page numbers follow the sequence established by pages in the created folder before them, they will be added to the document folder.

For example, the "Blue" document here had some page level classification problems. Page 1 correctly classified as a "Blue" Page 1, but pages 2 through 4 didn't classify confidently as anything.


However, we would be able to extract the page numbers at the bottom of each page ("pg 1 of 4", "pg 2 of 4" and so on). Furthermore, the unclassified secondary pages follow the numerical sequence started by the first "Blue" page.

Esp-auto-separation-about-31.png

Because the page numbering sequence keeps going, when using an EPI Extractor, ESP Auto Separation would correctly separate the pages.

The EPI Extractor' will stop appending pages to the folder once one of two things happens.

  1. A page confidently matches the first page of a Document Type
    • At which point a new document folder is created, as seen here.
  2. The EPI Extractor fails to return a page number in the sequence above it.
    • Note: There is some special functionality of the EPI Extractor to account for bad OCR or page numbers missing on certain pages. However, it can cause this logic of page numbering sequence to appear off in some cases (and can produce some false positive separation points). This functionality is baked into the EPI Separation Separation Provider. For more in depth review of this functionality, visit the EPI Separation article.

Esp-auto-separation-about-32.png


The EPI Extractor is configured on the Content Model object rather than the Document Type objects.

  1. To use this functionality, select the Content Model you're configuring.
  2. Use the EPI Extractor property to assign the extractor.
    • This can be set to Internal (for a simple regex pattern) or External (to reference a Data Type built elsewhere in the Node Tree).
    • This property only exists for the benefit of the ESP Auto Separation provider. It will have no impact on separation, classification or anything else in Grooper if you are not using ESP Auto Separation.


The extractor must define a group named "PageNo" in the Value Pattern's regular expression in order to properly return page numbers.

For example, if the document contains page numbers such as "Page 1 of 4", the following pattern would be required to generate the number groups

  • Page (?<PageNo>\d+) of \d+
  • Optionally, you may define a "PageCount" group to assist in separation. For example, if the document contains page numbers such as "Page 1 of 4", the following pattern would be required to generate the number groups and the page count groups

  • Page (?<PageNo>\d+) of (?<PageCount>\d+)
  • Esp-auto-separation-about-33.png

    Attachment Rules

    Attachment Rules allow the ESP Auto Separation provider to optionally merge a whole separated and classified document folder of one Document Type with a separated and classified document folder of another Document Type. The attachment Document Type can be appended or prepended to an assigned "host" Document Type when it comes after or before that document in the Batch.

    For example, contracts often have "Exhibits" after them that should be attached to the contract. If you have an "Exhibit" Document Type coming after a "Contract" Document Type in a Batch, you can set up Attachment Rules to append the exhibit to the contract, resulting in a single document.

    Say we want the "Green" Document Type to attach to the "Blue" Document Type.

    If present behind a "Blue" document folder, the pages inside the "Green" document folder should just be placed inside the "Blue" folder.

    Esp-auto-separation-about-34.png

    Without setting up Attachment Rules, the Batch would be separated thusly.

    The documents would be separated and classified just like normal using ESP Auto Separation.

    Esp-auto-separation-about-35.png

    Using the Attachment Rules property of the ESP Auto Separation provider, we would do three things:

    1. Assign the "Attachment" Document Type
      • This is the Document Type we want to append or prepend to another Document Type.
    • The "Green" Document Type in this case.
    1. Assign the "Host" Document Type
      • This is the Document Type we want the attachment Document Type to merge with.
    • The "Blue" Document Type in this case.
    1. Assign the "Direction" we expect to see in the batch.
      • We expect the attachment Document Type to be either before the host Document Type or after it (or both before or after it).
      • Here, the "Green" documents come after the "Blue" documents.

    Esp-auto-separation-about-36.png

    With these settings configured, we see the Batch separates a little differently.


    The first "Green" document is now merged with the first "Blue" document. The page that was inside the "Green" folder is now appended to the "Blue" folder above it.


    The second "Green" document comes after an "Orange" document, instead of a "Blue" one. Since the "Orange" document is not a Host Document Type, the folder is left alone. In this case, it would separate and classify as its own document folder in the Batch.

    Esp-auto-separation-about-37.png


    You configure this functionality using the Attachment Rules property of the ESP Auto Separation provider.

    Select this property and press the ellipsis button at the end to bring up the "Attachment Rule Collection Editor".

    Esp-auto-separation-about-38.png

    This editor allows you to add one or mroe attachment rules, specifying:

    1. The Attachment Document Type, using the Attachment Type property.
    2. The Host Document Type(s), using the Host Content Type property.
    3. Whether or not the attachment document comes before or after the host document, using the Direction property.
    Esp-auto-separation-about-39.png