2023.1:Separation (Concept)

From Grooper Wiki
Revision as of 10:24, 27 August 2024 by Randallkinard (talk | contribs)

This article is about an older version of Grooper.

Information may be out of date and UI elements may have changed.

20252023.120232.80

Separation is the process of taking an unorganized inventory_2 Batch of loose contract Batch Pages and organizing them into documents represented by folder Batch Folders in Grooper. This is done so Grooper can later assign a description Document Type to each document folder in a process known as "classification".

About

Let's revisit the Five Phases of Grooper.

  1. Acquire
    • Either physical pages are scanned into Grooper or digital files are imported into a Batch in Grooper.
  2. Condition
    • This involves running Recognize and OCR on the Batch to allow Grooper to read the text and clean up the pages if needed.
  3. Organize
    • This is where you separate the pages in the Batch into individual document folders.
    • After the pages have been separated, then the document folders are classified.
  4. Collect
    • Data is extracted from the documents.
  5. Deliver
    • The extracted data is exported from Grooper to the destination of your choice.

What is Separation?

Imagine you have a bunch of pages and you put them in a box. This is similar to importing pages into a Batch.

Now, let's say you need to look for a specific type of document in the box. It would be difficult to just go through all of the loose pages to find the documents you are looking for. It's difficult to determine where one document ends and another begins.

So, you take those pages and you sort them into folders. Each folder contains one document which is comprised of one or more pages. Now, it is much easier to tell the documents apart!

This is how separation works. It is organizing the pages so that Grooper can identify one document from another.

Why do we need to separate?

The point of separation is so Grooper can later Classify the documents. Classification is the process of assigning Document Types to the document folders. Classification can only be applied to a document folder, not loose pages. So, even if you are only bringing in a single document into Grooper, you would need to make sure the loose pages are contained within a document folder in order to classify.

The point of document classification is to let Grooper know what to do with the documents.

  • For example, Grooper won't know what data to extract from a document without a Data Model. Where does that Data Model come from? A Document Type. How does a document folder get a Document Type? Classification!
  • For more information on classification, please take a look at our Classification wiki article.

When do we need to separate?

There are two ways to bring documents into Grooper. You can scan physical copies of the pages directly into a Batch or you can import digital documents into Grooper. After you have your documents in your Batch, whether or not you need to separate depends first on whether your documents are scanned or imported. If imported, whether or not they need to be separated depends on if they are "Discreet" or "Packeted" documents.

Scanned Documents

Documents scanned into Grooper always need to be separated. Documents are scanned page by page into a Batch. Separation takes those loose pages and turns them into documents by creating a folder for each document and placing their pages inside.

Imported Documents: Discreet Documents

"Discreet" documents are digital documents where each file contains one document. When a file is imported into Grooper, it is automatically put in its own document folder. If the file only contains one individual document, then it will already be in its own document folder and there is no need to separate.

Imported Documents: Packeted Documents

Sometimes when you import digital documents, each file might contain multiple documents. These are considered "Packeted" documents. In this case, Grooper will bring the file in as a document folder, but it will need to be separated so that each document is contained within its own document folder.

How do we separate?

Separation is a process that happens as part of a Batch Process. You will need to add a Separate step (a Batch Process Step with the Activity property set to Separate) to your Batch Process and configure it with a Separation Provider. A Separation Provider is a property that tells Grooper how we want to run separation on the pages. For example, if each of our documents are four pages each, we might want to tell Grooper to separate every four pages. We could also separate every time Grooper finds a title at the top of a page.


There are 8 different Separation Providers you can configure. Here we are going to give a brief explanation of the provider, but for a deeper understanding you will need to visit each of their articles individually:

  • Change in Value Separation - Grooper will separate when it detects a value (that you configure) changes from one page to another, such as an invoice number.
  • Control Sheet Separation - During the scanning process, you can make sure to place a Grooper Control Sheet in between each document. Grooper will separate at the point it detects a Control Sheet.
  • EPI Separation - Grooper will separate based on extracted page numbers and will detect a new document when the page number resets or when a lower page number comes up in the Batch.
  • ESP Auto Separation - One of the more complicated separation techniques involving Lexical training.
    • Since ESP Auto Separation uses a Content Model's training data (as well as classification rules set on its Document Types), it both separates and classifies documents during the Separate activity.
  • Event-Based Separation - Grooper will separate based on an "event" such as after X number of pages or any time Grooper encounters a blank page. The following are events that can be configured for this provider:
    • Blank Page - A blank page will trigger a new folder.
    • Barcode - A scanned barcode will trigger a new folder.
    • Content Type - This Separation Event uses Lexical or Visual training examples to trigger folder creation. Whenever a page confidently matches a trained example document's first page, a new folder is created.
    • Page Count - This is for fixed page separation. A new folder is created by a set number of pages for a document.
    • Shape - A new folder is created every time a "shape feature" is detected.
  • Multi Separator - This provider allows you to use multiple Separation Providers at once.
  • Pattern-Based Separation - Grooper will separate based on text patterns, such as a document title or label.
  • Undo Separation - This provider actually turns separated documents back into loose pages.

Text-Based vs. Scan Supported

It is important to note that the majority of Separation Providers are "Text-Based" providers. That means they require readable data from your pages to work. OCR must be performed prior to any Text-Based provider being used. The following are Text-Based providers:

  • Change in Value Separation
  • EPI Separation
  • ESP Auto Separation
  • Pattern-Based Separation

"Scan Supported" providers do not need readable data to work. That means it is possible to run these providers as early as when you are scanning in physical pages. OCR is not required prior to running a Scan Supported provider. The following are Scan Supported Providers:

  • Control Sheet Separation
  • Event-Based Separation


Glossary

Activity: Grooper Activities define specific document processing operations done to a inventory_2 Batch, folder Batch Folder, or contract Batch Page. In a settings Batch Process, each edit_document Batch Process Step executes a single Activity (determined by the step's "Activity" property).

  • Batch Process Steps are frequently referred by the name of their configured Activity followed by the word "step". For example: "Classify step".

Batch Page: contract Batch Page nodes represent individual pages within a inventory_2 Batch. Batch Pages are created in one of two ways: (1) When images are scanned into a Batch using the Scan Viewer. (2) Or, when split from a PDF or TIFF file using the Split Pages activity.

  • Batch Pages are frequently referred to simply as "pages".

Batch Process Step: edit_document Batch Process Steps are specific actions within a settings Batch Process sequence. Each Batch Process Step performs an "Activity" specific to some document processing task. These Activities will either be a "Code Activity" or "Review" activities. Code Activities are automated by Activity Processing services. Review activities are executed by human operators in the Grooper user interface.

  • Batch Process Steps are frequently referred to as simply "steps".
  • Because a single Batch Process Step executes a single Activity configuration, they are often referred to by their referenced Activity as well. For example, a "Recognize step".

Batch Process: settings Batch Process nodes are crucial components in Grooper's architecture. A Batch Process is the step-by-step processing instructions given to a inventory_2 Batch. Each step is comprised of a "Code Activity" or a Review activity. Code Activities are automated by Activity Processing services. Review activities are executed by human operators in the Grooper user interface.

  • Batch Processes by themselves do nothing. Instead, they execute edit_document Batch Process Steps which are added as children nodes.
  • A Batch Process is often referred to as simply a "process".

Batch: inventory_2 Batch nodes are fundamental in Grooper's architecture. They are containers of documents that are moved through workflow mechanisms called settings Batch Processes. Documents and their pages are represented in Batches by a hierarchy of folder Batch Folders and contract Batch Pages.

Change in Value Separation: The Change in Value Separation Separation Provider creates a new folder and separates every time an extracted value changes from one contract Batch Page to another.

Classification: Classification is the process of identifying and organizing documents into categorical types based on their content or layout. Classification is key for efficient document management and data extraction workflows. Grooper has different methods for classifying documents. These include methods that use machine learning and text pattern recognition. In a Grooper Batch Process, the Classify Activity will assign a Content Type to a folder Batch Folder.

Classify: unknown_document Classify is an Activity that "classifies" folder Batch Folders in a inventory_2 Batch by assigning them a description Document Type.

  • Classification is key to Grooper's document processing. It affects how data is extracted from a document (during the Extract activity) and how Behaviors are applied.
  • Classification logic is controlled by a Content Model's "Classify Method". These methods include using text patterns, previously trained document examples, and Label Sets to identify documents.

Content Model: stacks Content Model nodes define a classification taxonomy for document sets in Grooper. This taxonomy is defined by the collections_bookmark Content Categories and description Document Types they contain. Content Models serve as the root of a Content Type hierarchy, which defines Data Element inheritance and Behavior inheritance. Content Models are crucial for organizing documents for data extraction and more.

Content Type: Content Types are a class of node types used used to classify folder Batch Folders. They represent categories of documents (stacks Content Models and collections_bookmark Content Categories) or distinct types of documents (description Document Types). Content Types serve an important role in defining Data Elements and Behaviors that apply to a document.

Control Sheet Separation: Control Sheet Separation is a Separation Provider that uses Grooper document_scanner Control Sheets to separate documents.

Data Model: data_table Data Models are leveraged during the Extract activity to collect data from documents (folder Batch Folders). Data Models are the root of a Data Element hierarchy. The Data Model and its child Data Elements define a schema for data present on a document. The Data Model's configuration (and its child Data Elements' configuration) define data extraction logic and settings for how data is reviewed in a Data Viewer.

Document Type: description Document Type nodes represent a distinct type of document, such as an invoice or a contract. Document Types are created as child nodes of a stacks Content Model or a collections_bookmark Content Category. They serve three primary purposes:

  1. They are used to classify documents. Documents are considered "classified" when the folder Batch Folder is assigned a Content Type (most typically, a Document Type).
  2. The Document Type's data_table Data Model defines the Data Elements extracted by the Extract activity (including any Data Elements inherited from parent Content Types).
  3. The Document Type defines all "Behaviors" that apply (whether from the Document Type's Behavior settings or those inherited from a parent Content Type).

EPI Separation: The EPI Separation Separation Provider uses embedded page information ("EPI") to Separate loose pages into document folders. A Data Extractor is used to find page numbers from the text on a page and Grooper uses this information to separate the pages.

ESP Auto Separation: ESP Auto Separation is a Separation Provider used for document separation. It is unique in that it both separates and classifies documents at the same time. It uses page-level classification training examples (among other things) to determine where to insert document folders in a inventory_2 Batch.

Event-Based Separation: Event-Based Separation is a Separation Provider that Separates documents using one or more "Separation Events". Each Separation Event triggers the creation of a new folder.

Five Phases of Grooper: The "Five Phases of Grooper" is a conceptual term that seeks to build understanding of how documents are processed through Grooper.

Lexical: "Lexical" is a Classify Method that classifies folder Batch Folders based on the text content of trained document examples. This is achieved through the statistical analysis of word frequencies that identify description Document Types.

Multi Separator: The Multi Separator Separation Provider performs separation using multiple Separation Providers. It allows users to create a list of any of the other Separation Providers. If the first provider on the list fails to separate a page (or, as more often is the case, a series of pages), the next one will be applied. If that fails, the next, and so on.

OCR: OCR is stands for Optical Character Recognition. It allows text on paper documents to be digitized, in order to be searched or edited by other software applications. OCR converts typed or printed text from digital images of physical documents into machine readable, encoded text.

Pattern-Based Separation: Pattern-Based Separation is a Separation Provider that creates a new document folder every time a value returned by a defined pattern is encountered on a page.

Pattern-Based: Pattern-Based is a Collation Provider option for pin Data Type extractors. Pattern-Based uses regular expressions to sequence returned results into a final result set.

Recognize: format_letter_spacing_wide Recognize is an Activity that obtains machine-readable text from contract Batch Pages and folder Batch Folders. When properly configured with an library_booksOCR Profile, Recognize will selectively perform OCR for images and native-text extraction for digital text in PDFs. Recognize can also reference an perm_mediaIP Profile to collect "layout data" like lines, checkboxes, and barcodes. Other Activities then use this machine-readable text and layout data for document analysis and data extraction.

Separate: insert_page_break Separate is an Activity that sorts contract Batch Pages into individual folder Batch Folders. This distinguishes "loose pages" from the documents formed by those pages. Once loose pages are separated into Batch Folder documents, they can be further processed by unknown_document Classify, export_notes Extract, output Export and other Activities that need to run on the folder (i.e. document) level.

Separation Provider: The Provider property of the Separate Activity defines the type of separation to be performed at the designated Scope.

Separation: Separation is the process of taking an unorganized inventory_2 Batch of loose contract Batch Pages and organizing them into documents represented by folder Batch Folders in Grooper. This is done so Grooper can later assign a description Document Type to each document folder in a process known as "classification".

Undo Separation: Undo Separation is a Separation Provider. Instead of putting loose contract Batch Pages into folder Batch Folders, this Separation Provider removes Batch Folders, leaving only loose pages.

Visual: "Visual" is a Classify Method that uses image analysis instead of text data to determine the description Document Type assigned to a folder Batch Folder during classification. Instead of using text-based extractors, an "Extract Features" IP Command in an perm_media IP Profile is used to collect image-based data from a Batch Folder's image(s). This image-based data is compared against that of previously trained document examples of each Document Type to classify the Batch Folder.