2.90:Separation Review: Difference between revisions

Revision as of 15:28, 25 March 2020

Separation uses various techniques to group pages into classified documents.

About Separation and Separation Review

Grooper uses various approaches and algorithms to determine the classification of a page or folder. The various settings on a Content Model and Document Type really add to the complexities for separating pages into documents. ESP Auto Separation removes but does not eliminate a lot of the manual work to separate and classify documents. Separation Review is a new review module designed to make the manual work quick and easy.

Use Cases

Separation

Training Scope

Normal
- This is the classic version of capturing training features for a document type.
FirstLast
- This is handy for training only the first and last page of a document type. It lowers the features training requirements, improves speed, and allows pages between the first and last page to magically be combined with the first and last page. This requires minimal effort for big returns.
FirstOnly
- If someone has used Document Titles extraction for a Positive Extractor to make Separation easy in the past, consider this property an upgrade to that approach.
- Taking the approach of only using Document Titles breaks down when an image has poor quality in it and causes the document titles’ extractor to miss the title.
- This allows the continued use of using Document Titles to be used as features combined with any other features on a page being trained so that if the document title is missed, the separation engine can immediately and automagically rely on features contained in the training for the document type.

Repeating Last Page

Contracts that contain signature pages being copied and distributed to involved parties and then signed and returned to be stored with the Contract document should use this feature.
Any document type that may or will have a duplicate last page.

Secondary Page Extractor

False-positive classification frequently happens on any page that is not page 1 of the document type, or the last page.
Use on document type that are always 2-page document types. To take advantage of this, configure a Positive Extractor in the Classification section and configure the Secondary Page Extractor to identify the second page. Cheating’s allowed in Grooper.

Examples

Separation will vary wildly between document types. Here are some real-world configurations using the new separation options. Some Classification Features are configured in the images. It’s common for Classification and Separation to be configured simultaneously. The explanations for each image consider classification and separation for runtime operations.

Example Number OneExample Number TwoExample Number ThreeExample Number FourExample Number Five

	Positive Extractor A Rule-Based Classification will occur and fallback on Features-Based Classification. Pagination It says this document can have any number of pages. Prioritize EPI This respects EPI if it is present, otherwise rely on EPI training from Features training. Secondary Page Extractor It determines if the page isn’t the first or last page of the document. If this extractor has a result, Separation appends this page to the page above it and understands the next page will be another secondary page, last page, or the start of a new document. The Content Model has an EPI Extractor configured. not shown here

	Allow Training No lexical features (as set on the Content Model) are considered for classification or separation. Positive Extractor Separation will only occur when the Positive Extractor has a result. Rule-Based Classification only. Pagination Automatically appends a specified number of pages using training. Repeating Last Page In considering Structured Pagination, and this property set to True, repeating last pages will be appended. Separation will separate using page-for-page trained samples only. Copies of the last page, such as signed pages from multiple parties, may exist and will be appended.

	Allow Training Lexical features (as set on the Content Model) are considered for classification and separation. Positive Extractor A Rule-Based Classification will occur and fallback on Features-Based Classification. Negative Extractor A result from this extractor will exclude this Document Type as a classification option. Pagination This document can have any number of pages. Training Scope The features on page 1 will be the only features saved in training. If Rules-Based Classification fails, only the first page’s features of a trained sample are used. When separation occurs and detects page 1 of this document type, all proceeding pages will be appended until another recognized document type is identified. The Content Model has an EPI Extractor configured and creates a hybrid of Rule-Based and Feature-Based Classification. not shown here

	Allow Training Lexical features (as set on the Content Model) are considered for classification and separation. Positive Extractor A Rule-Based Classification will occur and fallback on Features-Based Classification. Negative Extractor A result from this extractor will exclude this Document Type as a classification option. Pagination This document can have any number of pages. Prioritize EPI Respect EPI if it is present, otherwise rely on EPI training from Features training. Secondary Page Extractor - It determines if the page isn’t the first or last page of the document. If this extractor has a result, Separation appends this page to the page above it and understands the next page will be another secondary page, last page, or the start of a new document. The Content Model has an EPI Extractor configured and creates a hybrid of Rule-Based and Feature-Based Classification. not shown here

	Allow Training Lexical features (as set on the Content Model) are considered for classification and separation. Positive Extractor A Rule-Based Classification will occur and fallback on Features-Based Classification. Pagination This document can have any number of pages. Secondary Page Extractor - It determines if the page isn’t the first or last page of the document. If this extractor has a result, Separation appends this page to the page above it and understands the next page will be another secondary page, last page, or the start of a new document. The Content Model has an EPI Extractor configured and creates a hybrid of Rule-Based and Feature-Based Classification. not shown here

Version Differences

Separation

Separation in Grooper 2.9 adds three new properties to Document Types.

Training Scope
Repeating Last Page
Secondary Page Extractor

The Pagination property on a Document Type will determine if the new property will appear. See Property Details for more information.

Structured Document PropertiesUnstructured Document PropertiesFixed Document PropertiesExtended Document Properties

Grooper 2.8	Grooper 2.9

Grooper 2.8	Grooper 2.9

Grooper 2.8	Grooper 2.9

Grooper 2.8	Grooper 2.9

Separation Review

Grooper 2.8 Classification Review

Grooper 2.8 and previous versions relied on Classification Review with its various controls to separate loose pages, merge selected pages into documents, and correct misclassified folders or classify unclassified folders. This user interface required a moderate-to-large effort to complete classification and separation, especially when reverting one or more folders to loose pages in order to combine the same pages into separate documents and classifying each newly created document.

Grooper 2.9 Separation Review

Starting with Grooper 2.9, Separation Review replaces Classification Review, which will also still be available for legacy support or other use cases. Don’t be concerned about this being a new module. Slap the name “Classification Review 2.0” on Separation Review because Separation Review really is a true upgrade and efficiently redesigned approach of Classification Review. Consider it a godsend!

The design of Separation Review invites the heavy use of the keyboard and keyboard shortcuts to make quick work of any items needing corrective action. Many quality-of-life improvements now exist and are quickly realized if one has previous experience using Classification Review.

Even though this is Separation Review, some classification tasks take place in this module. In the How-To section, items in the list are color coded. The list shows current classification of the items. This doesn’t necessarily mean they are currently separated into documents. The actual separation occurs later. During use of Separation Review, using the various context menu commands will assist and ensure pages are classified correctly and ready for proper separation.

@@ Line 57: / Line 57: @@
 Separation will separate using page-for-page trained samples only.<br/>
 Copies of the last page, such as signed pages from multiple parties, may exist and will be appended.<br/>
+|}
+</tab>
+<tab name="Example Number Three" style="margin:25px">
+{| class="wikitable"
+| style="padding: 10px" | [[File:separation_and_review_11.png]] || style="padding: 10px" |
+'''Allow Training'''<br/>
+Lexical features (as set on the Content Model) are considered for classification and separation.<br/><br/>
+'''Positive Extractor'''<br/>
+A Rule-Based Classification will occur and fallback on Features-Based Classification.<br/><br/>
+'''Negative Extractor'''<br/>
+A result from this extractor will exclude this Document Type as a classification option.<br/><br/>
+'''Pagination'''<br/>
+This document can have any number of pages.<br/><br/>
+'''Training Scope'''<br/>
+The features on page 1 will be the only features saved in training. If Rules-Based Classification fails, only the first page’s features of a trained sample are used. When separation occurs and detects page 1 of this document type, all proceeding pages will be appended until another recognized document type is identified.<br/><br/>
+The Content Model has an EPI Extractor configured and creates a hybrid of Rule-Based and Feature-Based Classification. ''not shown here''<br/>
+|}
+</tab>
+<tab name="Example Number Four" style="margin:25px">
+{| class="wikitable"
+| style="padding: 10px" | [[File:separation_and_review_12.png]] || style="padding: 10px" |
+'''Allow Training'''<br/>
+Lexical features (as set on the Content Model) are considered for classification and separation.<br/><br/>
+'''Positive Extractor'''<br/>
+A Rule-Based Classification will occur and fallback on Features-Based Classification.<br/><br/>
+'''Negative Extractor'''<br/>
+A result from this extractor will exclude this Document Type as a classification option.<br/><br/>
+'''Pagination'''<br/>
+This document can have any number of pages.<br/><br/>
+'''Prioritize EPI'''<br/>
+Respect EPI if it is present, otherwise rely on EPI training from Features training.<br/><br/>
+'''Secondary Page Extractor'''<br/>
+- It determines if the page isn’t the first or last page of the document. If this extractor has a result, Separation appends this page to the page above it and understands the next page will be another secondary page, last page, or the start of a new document.<br/><br/>
+The Content Model has an EPI Extractor configured and creates a hybrid of Rule-Based and Feature-Based Classification. ''not shown here''<br/>
+|}
+</tab>
+<tab name="Example Number Five" style="margin:25px">
+{| class="wikitable"
+| style="padding: 10px" | [[File:separation_and_review_13.png]] || style="padding: 10px" |
+'''Allow Training'''<br/>
+Lexical features (as set on the Content Model) are considered for classification and separation.<br/><br/>
+'''Positive Extractor'''<br/>
+A Rule-Based Classification will occur and fallback on Features-Based Classification.<br/><br/>
+'''Pagination'''<br/>
+This document can have any number of pages.<br/><br/>
+'''Secondary Page Extractor'''<br/>
+- It determines if the page isn’t the first or last page of the document. If this extractor has a result, Separation appends this page to the page above it and understands the next page will be another secondary page, last page, or the start of a new document.<br/><br/>
+The Content Model has an EPI Extractor configured and creates a hybrid of Rule-Based and Feature-Based Classification. ''not shown here''<br/>
 |}
 </tab>
 </tabs>
 ==Version Differences==