2023.1:Change in Value Separation (Separation Provider): Difference between revisions

From Grooper Wiki
No edit summary
No edit summary
Line 2: Line 2:


<blockquote>{{#lst:Glossary|Change in Value Separation}}</blockquote>
<blockquote>{{#lst:Glossary|Change in Value Separation}}</blockquote>
{|class="download-box"
|
[[File:Asset 22@4x.png]]
|
You may download the ZIP(s) below and upload it into your own Grooper environment (version 2023.1). The first contains one or more '''Batches''' of sample documents.  The second contains one or more '''Projects''' with resources used in examples throughout this article.
* [[Media:2023.1 Wiki Change-In-Value-Separation Batches.zip]]
* [[Media:2023.1 Wiki Change-In-Value-Separation Project.zip]]
|}


== Glossary ==
== Glossary ==
Line 60: Line 69:


== How To ==
== How To ==
{|class="download-box"
|
[[File:Asset 22@4x.png]]
|
You may download the ZIP(s) below and upload it into your own Grooper environment (version 2023.1). The first contains one or more '''Batches''' of sample documents.  The second contains one or more '''Projects''' with resources used in examples throughout this article.
* [[Media:2023.1 Wiki Change-In-Value-Separation Batches.zip]]
* [[Media:2023.1 Wiki Change-In-Value-Separation Project.zip]]
|}


=== Setting the Provider ===
=== Setting the Provider ===

Revision as of 09:26, 8 May 2024

This article is about an older version of Grooper.

Information may be out of date and UI elements may have changed.

20252023.1

The Change in Value Separation Separation Provider creates a new folder and separates every time an extracted value changes from one contract Batch Page to another.

You may download the ZIP(s) below and upload it into your own Grooper environment (version 2023.1). The first contains one or more Batches of sample documents. The second contains one or more Projects with resources used in examples throughout this article.

Glossary

Activity: Grooper Activities define specific document processing operations done to a inventory_2 Batch, folder Batch Folder, or contract Batch Page. In a settings Batch Process, each edit_document Batch Process Step executes a single Activity (determined by the step's "Activity" property).

  • Batch Process Steps are frequently referred by the name of their configured Activity followed by the word "step". For example: "Classify step".

Array: Array is a Collation Provider option for pin Data Type extractors. Array matches a list of values arranged in horizontal, vertical, or text-flow order, combining instances that qualify into a single result.

Batch: inventory_2 Batch nodes are fundamental in Grooper's architecture. They are containers of documents that are moved through workflow mechanisms called settings Batch Processes. Documents and their pages are represented in Batches by a hierarchy of folder Batch Folders and contract Batch Pages.

Batch Folder: The folder Batch Folder is an organizational unit within a inventory_2 Batch, allowing for a structured approach to managing and processing a collection of documents. Batch Folder nodes serve two purposes in a Batch. (1) Primarily, they represent "documents" in Grooper. (2) They can also serve more generally as folders, holding other Batch Folders and/or contract Batch Page nodes as children.

  • Batch Folders are frequently referred to simply as "documents" or "folders" depending on how they are used in the Batch.

Batch Process: settings Batch Process nodes are crucial components in Grooper's architecture. A Batch Process is the step-by-step processing instructions given to a inventory_2 Batch. Each step is comprised of a "Code Activity" or a Review activity. Code Activities are automated by Activity Processing services. Review activities are executed by human operators in the Grooper user interface.

  • Batch Processes by themselves do nothing. Instead, they execute edit_document Batch Process Steps which are added as children nodes.
  • A Batch Process is often referred to as simply a "process".

Collation Provider: The Collation property of a pin Data Type defines the method for converting its raw results into a final result set. It is configured by selecting a Collation Provider. The Collation Provider governs how initial matches from the Data Type's extractor(s) are combined and interpreted to produce the Data Type's final output.

Content Model: stacks Content Model nodes define a classification taxonomy for document sets in Grooper. This taxonomy is defined by the collections_bookmark Content Categories and description Document Types they contain. Content Models serve as the root of a Content Type hierarchy, which defines Data Element inheritance and Behavior inheritance. Content Models are crucial for organizing documents for data extraction and more.

Data Context: Data Context refers to contextual information used to extract data, such as a label that identifies the value you want to collect.

Data Element: Data Elements are a class of node types used to collect data from a document. These include: data_table Data Models, insert_page_break Data Sections, variables Data Fields, table Data Tables, and view_column Data Columns.

Data Extractor: Data Extractor (or just "extractor") refers to all Value Extractors and Extractor Nodes. Extractors define the logic used to return data from a document's text content, including general data (such as a date) and specific data (such as an agreement date on a contract).

Data Model: data_table Data Models are leveraged during the Extract activity to collect data from documents (folder Batch Folders). Data Models are the root of a Data Element hierarchy. The Data Model and its child Data Elements define a schema for data present on a document. The Data Model's configuration (and its child Data Elements' configuration) define data extraction logic and settings for how data is reviewed in a Data Viewer.

Data Type: pin Data Types are nodes used to extract text data from a document. Data Types have more capabilities than quick_reference_all Value Readers. Data Types can collect results from multiple extractor sources, including a locally defined extractor, child extractor nodes, and referenced extractor nodes. Data Types can also collate results using Collation Providers to combine, sift and manipulate results further.

Extract: export_notes Extract is an Activity that retrieves information from folder Batch Folder documents, as defined by Data Elements in a data_table Data Model. This is how Grooper locates unstructured data on your documents and collects it in a structured, usable format.

Extractor Type:

Field Class: input Field Classes are NLP (natural language processing) based extractor nodes. They find values based on some natural language context near that value. Values are positively or negatively associated with text-based "features" nearby by training the extractor. During extraction, the extractor collects values based on these training weightings.

  • Field Classes are most useful when attempting to find values within the flow of natural language.
  • Field Classes can be configured to distinguish values within highly structured documents, but this type of extraction is better suited to simpler "extractor nodes" like quick_reference_all Value Readers or pin Data Types.
  • Advances in large-language models (LLMs) have largely made Field Classes obsolete. LLM-based extraction methods in Grooper (such as AI Extract) can achieve similar results with nowhere near the amount of set up.

Key-Value List: Key-Value List is a Collation Provider option for pin Data Type extractors. Key-Value List matches instances where a key and a list of one or more values appear together on the document, adhering to a specific layout pattern.

Key-Value Pair: Key-Value Pair is a Collation Provider option for pin Data Type extractors. Key-Value Pair matches instances where a key is paired with a value on the document in a specific layout. Note: Key-Value Pair is an older technique in Grooper. In most cases, the Labeled Value extractor is preferable to Key-Value Pair collation.

Labeled Value: Labeled Value is a Value Extractor that identifies and extracts a value next to a label. This is one of the most commonly used extractors to extract data from structured documents (such as a standardized form) and static values on semi-structured documents (such as the header details on an invoice).

Ordered Array: Ordered Array is a Collation Provider option for pin Data Type extractors. Ordered Array finds sequences of values where one result is present for each extractor, in the order they appear, according to a specified horizontal, vertical or text-flow layout.

Project: package_2 Projects are the primary containers for configuration nodes within Grooper. The Project is where various processing objects such as stacks Content Models, settings Batch Processes, profile objects are stored. This makes resources easier to manage, easier to save, and simplifies how node references are made in a Grooper Repository.

Regular Expression: Regular Expression (or regex) is a standard syntax designed to parse text strings. This is a way of finding information in text. It is the primary method by which Grooper extracts and returns data from documents.

Split: Split is a Collation Provider option for pin Data Type extractors. Split separates a data instance at each match returned by the Data Type. The results are used as anchor points to "split" text into one or more smaller parts.

Tab Marking: Tab Marking allows you to insert tab characters into a document's text data.

Value Reader: quick_reference_all Value Reader nodes define a single data extraction operation. Each Value Reader executes a single Value Extractor configuration. The Value Extractor determines the logic for returning data from a text-based document or page. (Example: Pattern Match is a Value Extractor that returns data using regular expressions).

  • Value Readers are can be used on their own or in conjunction with pin Data Types for more complex data extraction and collation.

About

A Data Extractor is written to find a value on a page (such as an invoice number on invoices or a report number on a report). This is set on the Value Extractor property. When the extractor returns a result on a page, the page is placed in a new folder, creating a new document. All subsequent pages returning the same value are included in the folder. Once a page is encountered returning a different value, a new Document Folder (and thus new document) is created.

If the extractor fails to produce a result, no folder will be created. The page will remain loose in the Batch and the provider will move on to the next page to check if its value is different from the last one produced. If this is not the desired result, the Miss Disposition property can be used to Append or Merge the pages to another folder.

How To

Setting the Provider

  1. In this example we have added a Separate Step to the Batch Process.
  2. We have set the Provider to Change in Value Separation.
  3. Click the hamburger menu to ther ight of the Value Extractor property.
  4. For this example wer are going to use a Pattern Match.


  1. We have put in a Value Pattern of Report #: (\d+|[A-Z]\d{2}-\d{3}) to return the report numbers from the documents in our Batch.


  1. When we run Separation, at first glance it looks like all of the document separated appropriately.


  1. If we look closer, we see that we have several pages that were not separated into a folder and remain as loose pages.
  2. We see that on page 2 and all subsequent pages of the fifth report, the report number is missing. Since Grooper did not return anything on the page, it didn't know what to do with the document so it left it as a loose page.


The Miss Disposition Property

In the previous section we ended up with several documents that were not separated into folders and remained loose pages. This was because Grooper did not know what to do with the documents that did not return a result. In this section, we are going to look at how the Miss Disposition can solve this problem for us.

  1. We are going to go back into our Separate Step.
  2. Take a look at the Miss Disposition property located under the "ACTIVITY PROPERTIES" panel. Click on the hamburger icon to access the drop-down menu.
  3. For this example, we are going to set the Miss Disposition to Append.


  1. With the Miss Disposition property set to Append, any document that does not return a result will be appended to the previous folder. Now when we run separation, these pages will be separated appropriately.

FYI

If you set the Miss Disposition property to Merge it will work the same way as Append but there will be an additional setting called the Maximum Gap. This allows you to set the maximum number of pages it can append to the folder.