2023.1:EPI Separation (Separation Provider)

From Grooper Wiki

This article is about an older version of Grooper.

Information may be out of date and UI elements may have changed.

20252023.1

The EPI Separation Separation Provider uses embedded page information ("EPI") to Separate loose pages into document folders. A Data Extractor is used to find page numbers from the text on a page and Grooper uses this information to separate the pages.

You may download the ZIP(s) below and upload it into your own Grooper environment (version 2023.1). The first contains one or more Batches of sample documents. The second contains one or more Projects with resources used in examples throughout this article.

About

For this Separation Provider, an Extractor is used to find page numbers from the text on a page. The extractor must define the page number as group "PageNo" in the regular expression (regex) pattern. If the page number is formatted as Page X of Y (Page 1 of 3) then a second group must be defined as "PageCount" in its regular expression pattern.

The pattern Page (?<PageNo>\d+) of (?<PageCount>\d+) would group the "1" and "3" of our earlier example properly.



If the value of PageNo is 1, a new folder is created. As long as each subsequent page's PageNo value follows in sequence, they are included in the folder.

If the page is out of sequence (or the extractor fails to produce a result), it is left as a loose page. If this is not the desired result, the Miss Disposition property can be configured to append all subsequent loose pages to the previous Document Folder.

How To

Setting the Provider

  1. Add a Separate Batch Process Step to a Batch Process.
  2. Set the Provider property to EPI Separation.
  3. Set the Value Extractor. In this example we are setting the Value Extractor to a Reference and then referencing a Data Type.


Using Capture Groups

The Value Extractor needs to be set to something that will return the page numbers needed for EPI Separation. For Grooper to understand that a number needs to be used for EPI Separation we need to contain that extracted number in a specifically named Capture Group.

  • A capture group named PageNo indicates the number being used for the EPI Separation or the actual page number.
  • A capture group named PageCount indicates the total number of pages in a document, if that number is present on a page.

Without being in an appropriately named Capture Group, Grooper won't know what to do with the extracted information.

FYI

While typing out your regex pattern, pressing Ctrl+G will insert a capture group into your pattern. Then all you have to do is enter in a name for your capture group.

The PageNo Capture Group

  1. The Data Type in this example has multiple Value Readers as children. Each Value Reader is configured with a Pattern Match to return the page number format for each page in this Batch.
  2. The regex pattern Page:? (?<pageNo>/d+) is collecting the page numbers from the first document in the Batch. The number we want to use for separation is contained within the Capture Group "PageNo".
  3. "Page: 1" is being returned from the first page of the Batch.


The PageCount Capture Group

When you have page numbers on a document that indicate the total number of pages (such as "page 1 of 4"), you can use the PageCount capture group name to return that value.

You might ask yourself why we would want to collect that information. There are times you might have a page with poor OCR or perhaps the page number is missing altogether. Using the PageCount capture group helps Grooper understand how many pages to expect in a document, so even if there are OCR errors, it has a higher probability of still separating appropriately.

In the example below, we show you how to use the PageCount capture group.

  1. The second Value Reader child of the Data Type in this example is configured to return a different page number format.
  2. The regex pattern Page:? (?<PageNo>\d+) of (?<PageCount>\d+) collects two numbers.
    • The actual page number is contained within the "PageNo" Capture Group.
    • The total number of pages in the document is contained within the "PageCount" Capture Group.
  3. "Page: 1 of 4" is returned with this regex pattern.


Using Lexicons

In the example below, we have a document in the Batch where the page number is spelled out (one, two, three, etc.) instead of a numeric value (1, 2, 3, etc.). Grooper is unable to understand the meaning of written words, however, we can use a Lexicon to translate the words to numeric values.

Looking at the Lexicon

  1. In the example below, our Value Reader is collecting the page numbers that are spelled out rather than numeric values.
  2. We still use the same regex capture group of "PageNo" for the number, but we need to change the words to something Grooper can understand.


  1. The NumberWords Lexicon in the node tree has been configured to help translate these words to numbers. Select the Lexicon.
  2. In the right panel, we can see that each word is matched up with it's corresponding numerical value in a list using the syntax one=1. This Lexicon can be used to translate the words like "one" to the numerical value like "1", which Grooper can use to separate the pages.


Applying the Lexicon

  1. Go back to the Value Reader in the node tree.
  2. Click on the "Tester" tab.
  3. Click on the "Properties" tab.
  4. Click the ellipsis icon to the right of the Group Options property under the "LOOKUP" category.


  1. On the left side of the "Group Options" window that pops up, select the relevant "Group Name" if not already selected.
  2. Open up the Vocabulary property.
  3. Click on the ellipsis icon to the right of the Included Lexicons property.


  1. Navigate through the folders in the "Included Lexicons" window that pops up and click the check box next to the Lexicon you wish to apply.
  2. Click "OK" at the top right of the window to apply the Lexicon.


  1. Finally, open up the Lookup Options property under "LOOKUP SETTINGS".
  2. Click the checkbox next to the Translate property.
    • This is required for Grooper to translate the text to numerical values.
  3. Click "OK" in the top right of the window to apply changes.


Glossary

Batch Process Step: edit_document Batch Process Step objects are specific actions within the sequence defined by a settings Batch Process. A Batch Process Step plays a critical role in automating and managing the flow of documents through the various stages of processing within Grooper.

  • Batch Process Steps are frequently referred to as simply "steps".
  • Because a single Batch Process Step executes a single Activity configuration, they are often referred to by their referenced Activity as well. For example, a "Recognize step".

Batch Process: settings Batch Process objects are crucial components in Grooper's architecture. A Batch Process orchestrates the document processing strategy and ensures each inventory_2 Batch of documents is managed systematically and efficiently.

  • Batch Processes by themselves do nothing. Instead, the workflows they execute are designed by adding child edit_document Batch Process Steps.
  • A Batch Process is often referred to as simply a "process".

Batch: inventory_2 Batch objects are fundamental in Grooper's architecture as they are the containers of documents that get moved through Grooper's workflow mechanisms known as settings Batch Processes.

Data Extractor: Data Extractor (or just "extractor") refers to all Extractor Types and extractor node objects. Extractors define the logic used to return data from a document's text content, including general data (such as a date) and specific data (such as an agreement date on a contract).

Data Type: pin Data Type objects hold a collection of child, referenced, and locally defined Data Extractors and settings that manage how multiple (even differing) matches from Data Extractors are consolidated (via Collation) into a result set.

EPI Separation: The EPI Separation Separation Provider uses embedded page information ("EPI") to Separate loose pages into document folders. A Data Extractor is used to find page numbers from the text on a page and Grooper uses this information to separate the pages.

Extract: export_notes Extract is an Activity that retrieves information from folder Batch Folder documents, as defined by Data Elements in a data_table Data Model. This is how Grooper locates unstructured data on your documents and collects it in a structured, usable format.

Lexicon: dictionary Lexicon node objects are dictionary objects that store a list of keys or key-value pairs. Lexicons can define local entries and/or import entries from other Lexicons and even import entries using a Data Connection. The entries in a Lexicon can be utilized in different areas of Grooper, such as data extraction, fuzzy matching, or OCR correction, providing a reference point that enhances the accuracy and consistency of the software's operations.

Lookup: A Lookup Specification defines a "lookup operation", where existing Grooper fields (called "lookup fields") are used to query an external data source, such as a database. The results of the lookup can be used to validate or populate field values (called "target fields") in Grooper. Lookup Specifications are created on "container elements" (data_table Data Models, insert_page_break Data Sections and table Data Tables) using their Lookups property. Lookups may query using all single-instance fields relative to the container element (including those defined on parent elements up to the root Data Model), but cannot be used to populate a field value on a parent of the container element.

OCR: OCR is stands for Optical Character Recognition. It allows text on paper documents to be digitized, in order to be searched or edited by other software applications. OCR converts typed or printed text from digital images of physical documents into machine readable, encoded text.

Pattern Match: Pattern Match is an Extractor Type that extracts values from a document that match a specified regular expression, providing data collection following a known format or pattern.

Project: package_2 Project node objects are the primary containers for configuration nodes within Grooper. The Project is where various processing objects such as stacks Content Models, settings Batch Processes, profile objects, and more are organized and managed. It allows for the encapsulation and modularization of these resources for easier management and reusability.

Reference: Reference is an Extractor Type used to reference an external extractor object within a Grooper property configuration. This allows users to create re-usable extractors and use the more complex pin Data Type and input Field Class extractors throughout Grooper.

Separate: insert_page_break Separate is an Activity that sorts contract Batch Pages into individual folder Batch Folders. This distinguishes "loose pages" from the documents formed by those pages. Once loose pages are separated into Batch Folder documents, they can be further processed by unknown_document Classify, export_notes Extract, output Export and other Activities that need to run on the folder (i.e. document) level.

Separation Provider: The Provider property of the Separate Activity defines the type of separation to be performed at the designated Scope.

Separation: Separation is the process of taking an unorganized inventory_2 Batch of loose contract Batch Pages and organizing them into documents represented by folder Batch Folders in Grooper. This is done so Grooper can later assign a description Document Type to each document folder in a process known as "classification".

Value Reader: quick_reference_all Value Reader objects define a single data extraction operation. You set the Extractor Type on the Value Reader that matches the specific data you're aiming to capture. For example, you would use the Pattern Match Extractor Type to return data using regular expression. You would use a Value Reader when you need to extract a single result or list of simple results from a document.