2023.1:EPI Separation (Separation Provider)

From Grooper Wiki

This article is about an older version of Grooper.

Information may be out of date and UI elements may have changed.

20252023.1

The EPI Separation Separation Provider uses embedded page information ("EPI") to Separate loose pages into document folders. A Data Extractor is used to find page numbers from the text on a page and Grooper uses this information to separate the pages.

You may download the ZIP(s) below and upload it into your own Grooper environment (version 2023.1). The first contains one or more Batches of sample documents. The second contains one or more Projects with resources used in examples throughout this article.

About

For this Separation Provider, an Extractor is used to find page numbers from the text on a page. The extractor must define the page number as group "PageNo" in the regular expression (regex) pattern. If the page number is formatted as Page X of Y (Page 1 of 3) then a second group must be defined as "PageCount" in its regular expression pattern.

The pattern Page (?<PageNo>\d+) of (?<PageCount>\d+) would group the "1" and "3" of our earlier example properly.



If the value of PageNo is 1, a new folder is created. As long as each subsequent page's PageNo value follows in sequence, they are included in the folder.

If the page is out of sequence (or the extractor fails to produce a result), it is left as a loose page. If this is not the desired result, the Miss Disposition property can be configured to append all subsequent loose pages to the previous Document Folder.

How To

Setting the Provider

  1. Add a Separate Batch Process Step to a Batch Process.
  2. Set the Provider property to EPI Separation.
  3. Set the Value Extractor. In this example we are setting the Value Extractor to a Reference and then referencing a Data Type.


Using Capture Groups

The Value Extractor needs to be set to something that will return the page numbers needed for EPI Separation. For Grooper to understand that a number needs to be used for EPI Separation we need to contain that extracted number in a specifically named Capture Group.

  • A capture group named PageNo indicates the number being used for the EPI Separation or the actual page number.
  • A capture group named PageCount indicates the total number of pages in a document, if that number is present on a page.

Without being in an appropriately named Capture Group, Grooper won't know what to do with the extracted information.

FYI

While typing out your regex pattern, pressing Ctrl+G will insert a capture group into your pattern. Then all you have to do is enter in a name for your capture group.

The PageNo Capture Group

  1. The Data Type in this example has multiple Value Readers as children. Each Value Reader is configured with a Pattern Match to return the page number format for each page in this Batch.
  2. The regex pattern Page:? (?<pageNo>/d+) is collecting the page numbers from the first document in the Batch. The number we want to use for separation is contained within the Capture Group "PageNo".
  3. "Page: 1" is being returned from the first page of the Batch.


The PageCount Capture Group

When you have page numbers on a document that indicate the total number of pages (such as "page 1 of 4"), you can use the PageCount capture group name to return that value.

You might ask yourself why we would want to collect that information. There are times you might have a page with poor OCR or perhaps the page number is missing altogether. Using the PageCount capture group helps Grooper understand how many pages to expect in a document, so even if there are OCR errors, it has a higher probability of still separating appropriately.

In the example below, we show you how to use the PageCount capture group.

  1. The second Value Reader child of the Data Type in this example is configured to return a different page number format.
  2. The regex pattern Page:? (?<PageNo>\d+) of (?<PageCount>\d+) collects two numbers.
    • The actual page number is contained within the "PageNo" Capture Group.
    • The total number of pages in the document is contained within the "PageCount" Capture Group.
  3. "Page: 1 of 4" is returned with this regex pattern.


Using Lexicons

In the example below, we have a document in the Batch where the page number is spelled out (one, two, three, etc.) instead of a numeric value (1, 2, 3, etc.). Grooper is unable to understand the meaning of written words, however, we can use a Lexicon to translate the words to numeric values.

Looking at the Lexicon

  1. In the example below, our Value Reader is collecting the page numbers that are spelled out rather than numeric values.
  2. We still use the same regex capture group of "PageNo" for the number, but we need to change the words to something Grooper can understand.


  1. The NumberWords Lexicon in the node tree has been configured to help translate these words to numbers. Select the Lexicon.
  2. In the right panel, we can see that each word is matched up with it's corresponding numerical value in a list using the syntax one=1. This Lexicon can be used to translate the words like "one" to the numerical value like "1", which Grooper can use to separate the pages.


Applying the Lexicon

  1. Go back to the Value Reader in the node tree.
  2. Click on the "Tester" tab.
  3. Click on the "Properties" tab.
  4. Click the ellipsis icon to the right of the Group Options property under the "LOOKUP" category.


  1. On the left side of the "Group Options" window that pops up, select the relevant "Group Name" if not already selected.
  2. Open up the Vocabulary property.
  3. Click on the ellipsis icon to the right of the Included Lexicons property.


  1. Navigate through the folders in the "Included Lexicons" window that pops up and click the check box next to the Lexicon you wish to apply.
  2. Click "OK" at the top right of the window to apply the Lexicon.


  1. Finally, open up the Lookup Options property under "LOOKUP SETTINGS".
  2. Click the checkbox next to the Translate property.
    • This is required for Grooper to translate the text to numerical values.
  3. Click "OK" in the top right of the window to apply changes.


Glossary

Batch Process Step: edit_document Batch Process Steps are specific actions within a settings Batch Process sequence. Each Batch Process Step performs an "Activity" specific to some document processing task. These Activities will either be a "Code Activity" or "Review" activities. Code Activities are automated by Activity Processing services. Review activities are executed by human operators in the Grooper user interface.

  • Batch Process Steps are frequently referred to as simply "steps".
  • Because a single Batch Process Step executes a single Activity configuration, they are often referred to by their referenced Activity as well. For example, a "Recognize step".

Batch Process: settings Batch Process nodes are crucial components in Grooper's architecture. A Batch Process is the step-by-step processing instructions given to a inventory_2 Batch. Each step is comprised of a "Code Activity" or a Review activity. Code Activities are automated by Activity Processing services. Review activities are executed by human operators in the Grooper user interface.

  • Batch Processes by themselves do nothing. Instead, they execute edit_document Batch Process Steps which are added as children nodes.
  • A Batch Process is often referred to as simply a "process".

Batch: inventory_2 Batch nodes are fundamental in Grooper's architecture. They are containers of documents that are moved through workflow mechanisms called settings Batch Processes. Documents and their pages are represented in Batches by a hierarchy of folder Batch Folders and contract Batch Pages.

Data Extractor: Data Extractor (or just "extractor") refers to all Value Extractors and Extractor Nodes. Extractors define the logic used to return data from a document's text content, including general data (such as a date) and specific data (such as an agreement date on a contract).

Data Type: pin Data Types are nodes used to extract text data from a document. Data Types have more capabilities than quick_reference_all Value Readers. Data Types can collect results from multiple extractor sources, including a locally defined extractor, child extractor nodes, and referenced extractor nodes. Data Types can also collate results using Collation Providers to combine, sift and manipulate results further.

EPI Separation: The EPI Separation Separation Provider uses embedded page information ("EPI") to Separate loose pages into document folders. A Data Extractor is used to find page numbers from the text on a page and Grooper uses this information to separate the pages.

Extract: export_notes Extract is an Activity that retrieves information from folder Batch Folder documents, as defined by Data Elements in a data_table Data Model. This is how Grooper locates unstructured data on your documents and collects it in a structured, usable format.

Lexicon: dictionary Lexicons are dictionaries used throughout Grooper to store lists of words, phrases, weightings for Fuzzy RegEx, and more. Users can add entries to a Lexicon, Lexicons can import entries from other Lexicons by referencing them, and entries can be dynamically imported from a database using a database Data Connection. Lexicons are commonly used to aid in data extraction, with the "List Match" and "Word Match" extractors utilizing them most commonly.

Lookup: A Lookup Specification defines a "lookup operation", where existing Grooper fields (called "lookup fields") are used to query an external data source, such as a database. The results of the lookup can be used to validate or populate field values (called "target fields") in Grooper. Lookup Specifications are created on "container elements" (data_table Data Models, insert_page_break Data Sections and table Data Tables) using their Lookups property. Lookups may query using all single-instance fields relative to the container element (including those defined on parent elements up to the root Data Model), but cannot be used to populate a field value on a parent of the container element.

OCR: OCR is stands for Optical Character Recognition. It allows text on paper documents to be digitized, in order to be searched or edited by other software applications. OCR converts typed or printed text from digital images of physical documents into machine readable, encoded text.

Pattern Match: Pattern Match is a Value Extractor that extracts values from a document that match a specified regular expression, providing data collection following a known format or pattern.

Project: package_2 Projects are the primary containers for configuration nodes within Grooper. The Project is where various processing objects such as stacks Content Models, settings Batch Processes, profile objects are stored. This makes resources easier to manage, easier to save, and simplifies how node references are made in a Grooper Repository.

Reference: Reference is a Value Extractor used to reference an Extractor Node. This allows users to create re-usable extractors and use the more complex pin Data Type and input Field Class extractors throughout Grooper.

Separate: insert_page_break Separate is an Activity that sorts contract Batch Pages into individual folder Batch Folders. This distinguishes "loose pages" from the documents formed by those pages. Once loose pages are separated into Batch Folder documents, they can be further processed by unknown_document Classify, export_notes Extract, output Export and other Activities that need to run on the folder (i.e. document) level.

Separation Provider: The Provider property of the Separate Activity defines the type of separation to be performed at the designated Scope.

Separation: Separation is the process of taking an unorganized inventory_2 Batch of loose contract Batch Pages and organizing them into documents represented by folder Batch Folders in Grooper. This is done so Grooper can later assign a description Document Type to each document folder in a process known as "classification".

Value Reader: quick_reference_all Value Reader nodes define a single data extraction operation. Each Value Reader executes a single Value Extractor configuration. The Value Extractor determines the logic for returning data from a text-based document or page. (Example: Pattern Match is a Value Extractor that returns data using regular expressions).

  • Value Readers are can be used on their own or in conjunction with pin Data Types for more complex data extraction and collation.