Separation and Separation Review
Separation uses various techniques to group pages into classified documents.
- 1 About Separation and Separation Review
- 2 Use Cases
- 3 How To
- 3.1 Separation
- 3.2 Separation Review
- 4 Property Details
- 5 Version Differences
About Separation and Separation Review
Grooper uses various approaches and algorithms to determine the classification of a page or folder. The various settings on a Content Model and Document Type really add to the complexities for separating pages into documents. ESP Auto Separation removes but does not eliminate a lot of the manual work to separate and classify documents. Separation Review is a new review module designed to make the manual work quick and easy.
Adjusting the Training Scope provides benefits to the accuracy and performance of ESP Auto Separation by focusing what is important when it comes time to separate and classify Unstructured paginated documents. For example, the Normal mode will create a single Form Type and divide trained examples into "First", "Middle" and "Last" pages. From individual document to individual document, often the most meaningful features composing them are found on the first and last pages, and there can be more variance on the pages in between. This is different from the previous approach, which created individual Form Types for each trained example, each with their own "Page X of X" Page Type objects. This unifies all trained examples into a single Form Type, making the training and classification of these documents ultimately simpler and more efficient. The FirstLast mode assumes meaningful features for classification are only found on the first and last pages, with the middle pages containing no information needed to make a separation or classification decision. With this mode enabled, only trained examples of the first and last page and their associated features will be saved. This can improve processing time by removing all the features in the middle pages for consideration. The FirstOnly mode narrows this scope even further by only storing features from the first page of trained documents.
Writer's Note: While these properties are available on Structured, Fixed, and Extended Pagination types, these Training Scopes are almost entirely pertinent to the Unstructured Pagination type. If you use these Training Scopes on anything besides an Unstructured Document Type, you may see they always are trained according to the "Normal" scope.
Normal - This is the classic version of capturing training features for a document type.
- This mode differs slightly for Unstructured Pagination. Instead of trained middle pages (ie Page 2 of 4 and Page 3 of 4) created as individual PageType objects of their multipage Form Type parents, they are created as "Middle" Page Type objects of a single Form Type. Furthermore last pages (ie 4 of 4 or 7 of 7) are combined into a single "Last" Page Type object.
- This cleans up the Node Tree by creating a single unstructured Form Type for documents of various page lengths, instead of many. This will speed up runtime classification and can potentially yield more accurate results.
FirstLast - This is handy for training only the first and last page of a document type. It lowers the features training requirements, improves speed, and allows middle pages between the first and last page to be combined with the first and last.
- Note: This approach assumes the middle pages will not be classified as confident first pages of a Document Type. For ESP Auto Separation. When a page is classified as the first page of a document, it a high priority indicator as the start of a new document.
FirstOnly - If someone has used "Document Titles" extraction with a Positive Extractor for separation in the past, consider this property an upgrade to that approach.
- A common approach to document classification is to write a positive rule extractor locating the document's title on the first page of the document. However, this approach can break down when an image is poor quality, causing the Document Type's title extractor to miss the title.
- The FirstOnly mode can allow the continued capture of features in titles combined with other trained features on the page. That way, if the title is missed, the separation engine can fall back on other features from trained first pages of the Document Type.
- This approach is highly similar to the Extended Pagination in concept, where the first page is used as the separation point and all pages after are included in the document folder until another confident first page is found. This approach differs in that only the first page is trained, eliminating the weighting data from the secondary pages. This can improve runtime classification speed greatly.
Repeating Last Page
- Contracts that contain signature pages being copied and distributed to involved parties and then signed and returned to be stored with the Contract document should use this feature.
- Any document type that may or will have a duplicate last page.
Secondary Page Extractor
- False-positive classification frequently happens on pages besides the first page of a document. In other words, middle pages and last pages, or secondary pages.
- This property allows you to use an extractor to configure rules when to attach a secondary page to a particular Document Type whose first page has already been identified during ESP Auto Separation.
- Use on multi-page Document Types using Unstructured Pagination. You can configure a Secondary Page Extractor on any To take advantage of this, configure a Positive Extractor in the Classification section and configure the Secondary Page Extractor to identify the second page. Cheating’s allowed in Grooper.
Separation will vary wildly between document types. Here are some real-world configurations using the new separation options. Some Classification Features are configured in the images. It’s common for Classification and Separation to be configured simultaneously. The explanations for each image consider classification and separation for runtime operations.
Use the Separation Review activity (in a Batch Process) any time a user will need to validate Separation and Classification is 100% accurate. The type of documents being processed (complexity, OCR result variances, etc.) can determine whether a user will need Separation Review.
Following is a (by no means exhaustive) list of industries whose documents have shown to frequently require Separation Review. These documents are considered unstructured document types due to complex and extremely variable language present on the document pages.
- Oil & Gas
Documents that may optionally need Separation Review will depend on the priority of desired accuracy. These documents are considered structured document types because the pages contain predictable locations of repeated fields, tables, and sections used in both simple and complex classification techniques.
- Accounts Recievable
- Accounts Payable
- Tax Documents
- HR Documents
Configuring Separation relies on understanding the different Separation Provider types and how to configure them.
Change in Value Separation
Please refer to this article on Change in Value Separation for information on this specific provider.
Control Sheet Separation
Please refer to this article on Control Sheet Separation for information on this specific provider.
Please refer to this article on EPI Separation for information on this specific provider.
ESP Auto Separation
Please refer to this article on ESP Auto Separation for information on this specific provider.
Please refer to this article on Event-Based Separation for information on this specific provider.
Please refer to this article on Multi Separator for information on this specific provider.
Please refer to this article on Pattern-Based Separation for information on this specific provider.
Please refer to this article on Undo Separation for information on this specific provider.
Configuring Separation Review
Separation Review is a configuration of the Review activity of a Batch Process.
1. First, create a Batch Process.
Attachment Rules – Configure how certain document types should be attached to a regular document type.
Respect Original Page Numbers – If a PDF or mult-page TIF file is used to create individual pages, Grooper will keep track of the page’s original number and use that value to prevent the last page of one document from being attached to the first page of the next document.
Bind Only – If this is set to true, all pages will remain individual pages when Separation Review begins. If it is set to false, then Separation will attempt to classification and separation of pages into classified folders before Separation Review begins. As of this writing, setting this property to False seems to make Separation Review easier to complete.
Goto First Invalid Item – Separation Review will auto-select the first item that couldn’t be assigned a document type because that item’s confidence didn’t meet or exceed the threshold for confident classification set on the Content Model.
Flag Messages – Optional property to predefined acceptable messages to assign to items in Separation Review.
|The documents are highlighted with two alternating colors to easily distinguish how separation will occur without needing to check the name of the Document Type.|
If an item has red text, this means the item has a classification flag. In this image, hovering over Page 9 would show a tooltip with the reason the item is flagged.
|This is the Candidates list view showing a list of candidates and their respective confidences based on the current selection.|
Appending and Prepending to Classified Groups of Documents
Multi-selecting Pages and Classifying Them
The keyboard should be the only control needed to work through Separation Review. When one or more pages are selected, start typing the name of the Document Type. This will filter the Candidates list view. Typing “paid” in this example, reduces the candidate list to five Document Types because their names have the word “paid” in them.
Completing Separation Review
When all pages are classified and the Separate button is clicked, the list will change and show all pages separated into classified documents. Notice the Folder icons and Folder names.
Clicking the button next to the separate button will undo the separation and revert folders back to pages.
-All pages are separated into folers.-
Clicking the (blue arched arrow) Undo button will undo all manual classification performed during Separation Review and revert the batch of pages to their previous state when Separation Review started. Note that deleted pages will not be recovered. Deleted pages are permanently unrecoverable.
This makes it possible to alter the separation training and logic if a Document Type has special attributes, pages, or layout that would normally confuse the ESP Separation logic.
- This is the classic version of capturing training features for a document type.
- Only trains the first and last page of a document type. See Use Cases for more information.
- Only trains the first page of a document type. See Use Cases for more information.
Repeating Last Page
Some Document Types require duplicate last pages. Normally, Separation would see a last page, complete the separation for a document, and leave duplicate last pages as loose pages. Enabling this property will reduce operator work in Separation Review by reducing the amount of page appending commands.
- This property is only available when a Document Type’s Pagination is set to Structured.
Secondary Page Extractor
Any result created by this extractor identifies the page is not the first or last page of a document.
- This property is only available when a Document Type’s Pagination is set to Unstructured.
Object Commands from the Context Menu
Separation in Grooper 2.9 adds three new properties to Document Types.
- Training Scope
- Repeating Last Page
- Secondary Page Extractor
The Pagination property on a Document Type will determine if the new property will appear. See Property Details for more information.
Grooper 2.8 Classification Review
Grooper 2.8 and previous versions relied on Classification Review with its various controls to separate loose pages, merge selected pages into documents, and correct misclassified folders or classify unclassified folders. This user interface required a moderate-to-large effort to complete classification and separation, especially when reverting one or more folders to loose pages in order to combine the same pages into separate documents and classifying each newly created document.
Grooper 2.9 Separation Review
Starting with Grooper 2.9, Separation Review replaces Classification Review, which will also still be available for legacy support or other use cases. Don’t be concerned about this being a new module. Slap the name “Classification Review 2.0” on Separation Review because Separation Review really is a true upgrade and efficiently redesigned approach of Classification Review. Consider it a godsend!
The design of Separation Review invites the heavy use of the keyboard and keyboard shortcuts to make quick work of any items needing corrective action. Many quality-of-life improvements now exist and are quickly realized if one has previous experience using Classification Review.
Even though this is Separation Review, some classification tasks take place in this module. In the How-To section, items in the list are color coded. The list shows current classification of the items. This doesn’t necessarily mean they are currently separated into documents. The actual separation occurs later. During use of Separation Review, using the various context menu commands will assist and ensure pages are classified correctly and ready for proper separation.