2023:Separation Mockup - RP: Difference between revisions

Revision as of 14:53, 10 January 2024

Separation is the process of taking an unorganized batch of loose pages and organizing them into folders. Each folder contains one document. This is done so Grooper can later assign a Document Type to each document in a process known as Classification.

About

Let's revisit the first three of the five phases of Grooper.

Acquire
- This involves bringing in a Batch into Grooper. Usually, documents are scanned into Grooper and the initial Batch looks like just one long document with individual pages.
Condition
- This involves running Recognize and OCR on the Batch to allow Grooper to read the text and clean up the document if needed.
Organize
- This is where separation takes place.

What is Separation?

Imagine you have a box of documents, but they are all just in a box. You might want to go through that box and organize them. So, first you get a filing cabinet and you put all of the pages in that filing cabinet. This is similar to importing documents into a Batch.

Now, let's say you need to look for a specific type of document in the file cabinet. It would be difficult to just go through all of the loose pages to find the documents. It's difficult to determine where one document ends and another begins. So, you take those pages and you sort them into folders. Each folder contains one document which is comprised of one or more pages. Now, it is much easier to tell the documents apart.

This is essentially how separation works. It is organizing the documents so that Grooper can identify one document from another.

Why do we need to separate documents?

When you bring documents into Grooper, many times they will come in as just a group of pages. If you scan the documents, they come in the order the pages were scanned. Odds are that when you bring in a Batch, you will be bringing in more than just one document. When Grooper gets a fresh set of pages, it has no way to know what the documents are, where one document starts and/or stops, or how they should be organized.

We must find a way to tell Grooper where a document begins and ends. Once Grooper can determine that, it can separate the Batch into Folders and be able to tell one document from another. This is important to be able to later assign Document Types to documents in a process called Classification.

To tell Grooper how documents need to be separated, we configure Separation Providers to automatically separate documents through a Batch Process.

When do we need to separate documents?

There are some times you need to separate document, and others that you don't. How do you tell the difference? Essentially the difference lies in whether or not your documents are already separated or not.

If you bring in a Batch that is essentially just a bunch of loose pages and not organized by document, then you will need to separate the pages into folders by document.
If you bring in a Batch that already has individual documents contained in their own folders, then there is no need to perform separation.

FYI

Scanned documents come in as loose pages into the Batch and will always need to be separated. Imported documents may or may not need to be separated depending on how they are brought into Grooper and if the documents are already in separate folders within the Batch.

How do we separate?

Separation is a process that happens as part of a Batch Process. You will need to add a separation Batch Process Step and configure it with a Separation Provider.

There are 8 different Separation Providers you can configure. Here we are going to give a brief explanation of the provider, but for a deeper understanding you will need to visit each of their articles individually:

Change in Value Separation - Grooper will separate when it detects a value (that you configure) changes from one page to another, like an invoice number for example.
Control Sheet Separation - Grooper will separate a document at the point it detects a "Control Sheet".
EPI Separation - Grooper will separate based on extracted page numbers and will detect a new document when the page number resets or when a lower page number comes up in the Batch.
ESP Auto Separation - One of the more complicated separation techniques involving Lexical training.
Event-Based Separation - Grooper will separate based on an "event" such as after X number of pages or any time Grooper ecounters a blank page.
Multi Separator - This provider allows you to use multiple Separation Providers at once.
Pattern-Based Separation - Grooper will separate based on text patterns, such as a document title or label.
Undo Separation - This provider actually turns separated documents back into loose pages.

Lexical vs. Real-Time

It is important to note that the majority of Separation Providers are "Lexical" providers. That means they require readable data from a document to work. The Recognize activity must be performed on documentation prior to any Lexical provider being used. The following are Lexical providers:

Change in Value Separation
EPI Separation
ESP Auto Separation
Pattern-Based Separation

"Real-Time" providers do not need readable data from documents to work. That means it is possible to run these providers as early as when you scan in documents. The Recognize activity is not required prior to running a Real-Time provider. The following are Real-Time Providers:

Control Sheet Separation
Event-Based Separation

@@ Line 58: / Line 58: @@
 * [[ESP Auto Separation]] - One of the more complicated separation techniques involving Lexical training.
 * [[Event-Based Separation]] - Grooper will separate based on an "event" such as after X number of pages or any time Grooper ecounters a blank page.
-* [[Multi-Separator]] - This provider allows you to use multiple '''''Separation Providers''''' at once.
+* [[Multi Separator]] - This provider allows you to use multiple '''''Separation Providers''''' at once.
-* [[Pattern Based Separation]] - Grooper will separate based on text patterns, such as a document title or label.
+* [[Pattern-Based Separation]] - Grooper will separate based on text patterns, such as a document title or label.
 * [[Undo Separation]] - This provider actually turns separated documents back into loose pages.