2023:Separation Mockup - RP: Difference between revisions

From Grooper Wiki
Final // Edit via Wikitext Extension for VSCode
Final Final // Edit via Wikitext Extension for VSCode
 
(7 intermediate revisions by the same user not shown)
Line 1: Line 1:
<blockquote>
<blockquote>
Separation is the process of taking an unorganized batch of loose pages and organizing them into folders. Each folder contains one document. This is done so Grooper can later assign a '''Document Type''' to each document in a process known as Classification.  
Separation is the process of taking an unorganized '''Batch''' of loose pages and organizing them into document folders. This is done so Grooper can later assign a '''Document Type''' to each document folder in a process known as Classification.  
</blockquote>
</blockquote>


== About ==
== About ==


Let's revisit the first three of the five phases of Grooper.  
Let's revisit the five phases of Grooper.  


# Acquire
# Acquire
#* This involves bringing in a '''Batch''' into Grooper. Usually, documents are scanned into Grooper and the initial '''Batch''' looks like just one long document with individual pages.  
#* Either physical pages are scanned into Grooper or digital files are imported into a '''Batch''' in Grooper.
# Condition
# Condition
#* This involves running Recognize and OCR on the '''Batch''' to allow Grooper to read the text and clean up the document if needed.  
#* This involves running Recognize and OCR on the '''Batch''' to allow Grooper to read the text and clean up the pages if needed.  
# Organize
# Organize
#* This is where separation takes place.
#* '''This is where you separate the pages in the Batch into individual document folders. '''
#* After the pages have been separated, then the document folders are Classified.
# Collect
#* Data is extracted from the documents.
# Deliver
#* The extracted data is exported from Grooper to the destination of your choice.  


=== What is Separation? ===
=== What is Separation? ===


Imagine you have a box of documents, but they are all just in a box. You might want to go through that box and organize them. So, first you get a filing cabinet and you put all of the pages in that filing cabinet. This is similar to importing documents into a '''Batch'''.  
Imagine you have a bunch of pages and you put them in a box. This is similar to importing pages into a '''Batch'''.  


Now, let's say you need to look for a specific type of document in the file cabinet. It would be difficult to just go through all of the loose pages to find the documents. It's difficult to determine where one document ends and another begins. So, you take those pages and you sort them into folders. Each folder contains one document which is comprised of one or more pages. Now, it is much easier to tell the documents apart.  
Now, let's say you need to look for a specific type of document in the box. It would be difficult to just go through all of the loose pages to find the documents you are looking for. It's difficult to determine where one document ends and another begins.  


This is essentially how separation works. It is organizing the documents so that Grooper can identify one document from another.
So, you take those pages and you sort them into folders. Each folder contains one document which is comprised of one or more pages. Now, it is much easier to tell the documents apart!


[[File:2023 Separation Mockup 01 About 01 When 01.png|500px]]
This is how separation works. It is organizing the pages so that Grooper can identify one document from another.  


=== Why do we need to separate documents? ===
[[File:2023 Separation Mockup 01 About 01 When 01.png|500px|center]]


When you bring documents into Grooper, many times they will come in as just a group of pages. If you scan the documents, they come in the order the pages were scanned. Odds are that when you bring in a '''Batch''', you will be bringing in more than just one document. When Grooper gets a fresh set of pages, it has no way to know what the documents are, where one document starts and/or stops, or how they should be organized.
=== Why do we need to separate? ===


We must find a way to tell Grooper where a document begins and ends. Once Grooper can determine that, it can separate the '''Batch''' into '''Folders''' and be able to tell one document from another. This is important to be able to later assign '''Document Types''' to documents in a process called Classification.  
The point of separation is so Grooper can later classify the documents. Classification is the process of assigning '''Document Types''' to the document folders. Classification can only be applied to a document folder, not loose pages. So, even if you are only bringing in a single document into Grooper, you would need to make sure the loose pages are contained within a document folder in order to classify.  


To tell Grooper how documents need to be separated, we configure '''''Separation Providers''''' to automatically separate documents through a '''Batch Process'''.  
The point of Classification is to let Grooper know what to do with the documents. For more information on Classification, please take a look at our [[Classification]] wiki article.


=== When do we need to separate documents? ===
=== When do we need to separate? ===


There are some times you need to separate document, and others that you don't. How do you tell the difference? Essentially the difference lies in whether or not your documents are already separated or not.  
There are two ways to bring documents into Grooper. You can scan physical copies of the pages directly into a '''Batch''' or you can import digital documents into Grooper. After you have your documents in your '''Batch''', whether or not you need to separate depends on whether your documents are "Discreet" or "Packeted" documents.  


*If you bring in a '''Batch''' that is essentially just a bunch of loose pages and not organized by document, then you will need to separate the pages into folders by document.
*If you bring in a '''Batch''' that already has individual documents contained in their own folders, then there is no need to perform separation.


{|class="fyi-box"
'''Discreet Documents'''
|-
<br>"Discreet" documents are digital documents where each file containes one document. When a file is imported into Grooper, it is automatically put in its own document folder. If the file only contains one individual document, then it will already be in its own document folder and there is no need to separate.
|
 
'''FYI'''
 
|
'''Packeted Documents'''
Scanned documents come in as loose pages into the '''Batch''' and will always need to be separated. Imported documents may or may not need to be separated depending on how they are brought into Grooper and if the documents are already in separate folders within the '''Batch'''.
<br>Sometimes when you import digital documents, each file might contain multiple documents. These are considered "Packeted" documents. In this case, Grooper will bring the file in as a document folder, but it will need to be separated so that each document is contained within its own document folder.  
|}
 
When you scan physical pages into Grooper, they come into the '''Batch''' as loose pages. Scanned pages will ALWAYS need to be separated.  
 
[[File:2023 Separation Mockup 01 About 01 When 02.png|center|800px]]


=== How do we separate? ===
=== How do we separate? ===


Separation is a process that happens as part of a '''Batch Process'''. You will need to add a separation '''Batch Process Step''' and configure it with a '''''Separation Provider'''''.  
Separation is a process that happens as part of a '''Batch Process'''. You will need to add a '''Separate Step''' to your '''Batch Process''' and configure it with a '''''Separation Provider'''''. A '''''Separation Provider''''' is a property that tells Grooper how we want to run separation on the pages. For example, if each of our documents are four pages each, we might want to tell Grooper to separate every four pages. We could also separate every time Grooper finds a title at the top of a page.


There are 8 different '''''Separation Providers''''' you can configure. Here we are going to give a brief explanation of the provider, but for a deeper understanding you will need to visit each of their articles individually:  
There are 8 different '''''Separation Providers''''' you can configure. Here we are going to give a brief explanation of the provider, but for a deeper understanding you will need to visit each of their articles individually:  


* [[Change in Value Separation]] - Grooper will separate when it detects a value (that you configure) changes from one page to another, like an invoice number for example.
* [[Change in Value Separation]] - Grooper will separate when it detects a value (that you configure) changes from one page to another, such as an invoice number.
* [[Control Sheet Separation]] - Grooper will separate a document at the point it detects a "Control Sheet".  
* [[Control Sheet Separation]] - During the scanning process, you can make sure to place a "Control Sheet" in between each document. Grooper will separate at the point it detects a "Control Sheet".  
* [[EPI Separation]] - Grooper will separate based on extracted page numbers and will detect a new document when the page number resets or when a lower page number comes up in the '''Batch'''.
* [[EPI Separation]] - Grooper will separate based on extracted page numbers and will detect a new document when the page number resets or when a lower page number comes up in the '''Batch'''.
* [[ESP Auto Separation]] - One of the more complicated separation techniques involving Lexical training.
* [[ESP Auto Separation]] - One of the more complicated separation techniques involving Lexical training.
* [[Event-Based Separation]] - Grooper will separate based on an "event" such as after X number of pages or any time Grooper ecounters a blank page.
* [[Event-Based Separation]] - Grooper will separate based on an "event" such as after X number of pages or any time Grooper ecounters a blank page.
* [[Multi-Separator]] - This provider allows you to use multiple '''''Separation Providers''''' at once.  
* [[Multi Separator]] - This provider allows you to use multiple '''''Separation Providers''''' at once.  
* [[Pattern Based Separation]] - Grooper will separate based on text patterns, such as a document title or label.  
* [[Pattern-Based Separation]] - Grooper will separate based on text patterns, such as a document title or label.  
* [[Undo Separation]] - This provider actually turns separated documents back into loose pages.  
* [[Undo Separation]] - This provider actually turns separated documents back into loose pages.  


=== Lexical vs. Real-Time ===
=== Text-Based vs. Scan Supported ===


It is important to note that the majority of '''''Separation Providers''''' are "Lexical" providers. That means they require readable data from a document to work. The ''Recognize'' activity must be performed on documentation prior to any Lexical provider being used. The following are Lexical providers:  
It is important to note that the majority of '''''Separation Providers''''' are "Text-Based" providers. That means they require readable data from your pages to work. OCR must be performed prior to any Text-Based provider being used. The following are '''Text-Based''' providers:  


* Change in Value Separation
* Change in Value Separation
Line 71: Line 77:
* Pattern-Based Separation
* Pattern-Based Separation


"Real-Time" providers do not need readable data from documents to work. That means it is possible to run these providers as early as when you scan in documents. The ''Recognize'' activity is not required prior to running a Real-Time provider. The following are Real-Time Providers:
"Scan Supported" providers do not need readable data to work. That means it is possible to run these providers as early as when you are scanning in physical pages. OCR is not required prior to running a Scan Supported provider. The following are '''Scan Supported''' Providers:


* Control Sheet Separation
* Control Sheet Separation
* Event-Based Separation
* Event-Based Separation

Latest revision as of 11:04, 15 January 2024

Separation is the process of taking an unorganized Batch of loose pages and organizing them into document folders. This is done so Grooper can later assign a Document Type to each document folder in a process known as Classification.

About

Let's revisit the five phases of Grooper.

  1. Acquire
    • Either physical pages are scanned into Grooper or digital files are imported into a Batch in Grooper.
  2. Condition
    • This involves running Recognize and OCR on the Batch to allow Grooper to read the text and clean up the pages if needed.
  3. Organize
    • This is where you separate the pages in the Batch into individual document folders.
    • After the pages have been separated, then the document folders are Classified.
  4. Collect
    • Data is extracted from the documents.
  5. Deliver
    • The extracted data is exported from Grooper to the destination of your choice.

What is Separation?

Imagine you have a bunch of pages and you put them in a box. This is similar to importing pages into a Batch.

Now, let's say you need to look for a specific type of document in the box. It would be difficult to just go through all of the loose pages to find the documents you are looking for. It's difficult to determine where one document ends and another begins.

So, you take those pages and you sort them into folders. Each folder contains one document which is comprised of one or more pages. Now, it is much easier to tell the documents apart!

This is how separation works. It is organizing the pages so that Grooper can identify one document from another.

Why do we need to separate?

The point of separation is so Grooper can later classify the documents. Classification is the process of assigning Document Types to the document folders. Classification can only be applied to a document folder, not loose pages. So, even if you are only bringing in a single document into Grooper, you would need to make sure the loose pages are contained within a document folder in order to classify.

The point of Classification is to let Grooper know what to do with the documents. For more information on Classification, please take a look at our Classification wiki article.

When do we need to separate?

There are two ways to bring documents into Grooper. You can scan physical copies of the pages directly into a Batch or you can import digital documents into Grooper. After you have your documents in your Batch, whether or not you need to separate depends on whether your documents are "Discreet" or "Packeted" documents.


Discreet Documents
"Discreet" documents are digital documents where each file containes one document. When a file is imported into Grooper, it is automatically put in its own document folder. If the file only contains one individual document, then it will already be in its own document folder and there is no need to separate.


Packeted Documents
Sometimes when you import digital documents, each file might contain multiple documents. These are considered "Packeted" documents. In this case, Grooper will bring the file in as a document folder, but it will need to be separated so that each document is contained within its own document folder.

When you scan physical pages into Grooper, they come into the Batch as loose pages. Scanned pages will ALWAYS need to be separated.

How do we separate?

Separation is a process that happens as part of a Batch Process. You will need to add a Separate Step to your Batch Process and configure it with a Separation Provider. A Separation Provider is a property that tells Grooper how we want to run separation on the pages. For example, if each of our documents are four pages each, we might want to tell Grooper to separate every four pages. We could also separate every time Grooper finds a title at the top of a page.

There are 8 different Separation Providers you can configure. Here we are going to give a brief explanation of the provider, but for a deeper understanding you will need to visit each of their articles individually:

  • Change in Value Separation - Grooper will separate when it detects a value (that you configure) changes from one page to another, such as an invoice number.
  • Control Sheet Separation - During the scanning process, you can make sure to place a "Control Sheet" in between each document. Grooper will separate at the point it detects a "Control Sheet".
  • EPI Separation - Grooper will separate based on extracted page numbers and will detect a new document when the page number resets or when a lower page number comes up in the Batch.
  • ESP Auto Separation - One of the more complicated separation techniques involving Lexical training.
  • Event-Based Separation - Grooper will separate based on an "event" such as after X number of pages or any time Grooper ecounters a blank page.
  • Multi Separator - This provider allows you to use multiple Separation Providers at once.
  • Pattern-Based Separation - Grooper will separate based on text patterns, such as a document title or label.
  • Undo Separation - This provider actually turns separated documents back into loose pages.

Text-Based vs. Scan Supported

It is important to note that the majority of Separation Providers are "Text-Based" providers. That means they require readable data from your pages to work. OCR must be performed prior to any Text-Based provider being used. The following are Text-Based providers:

  • Change in Value Separation
  • EPI Separation
  • ESP Auto Separation
  • Pattern-Based Separation

"Scan Supported" providers do not need readable data to work. That means it is possible to run these providers as early as when you are scanning in physical pages. OCR is not required prior to running a Scan Supported provider. The following are Scan Supported Providers:

  • Control Sheet Separation
  • Event-Based Separation