2024:Flow Collation (Concept)

From Grooper Wiki
Revision as of 18:02, 20 November 2024 by Randallkinard (talk | contribs)

This article is about an older version of Grooper.

Information may be out of date and UI elements may have changed.

202520242.80

"Flow Collation" refers to the text-flow based layout option used by various Collation Providers forpin Data Type extractors.

This is particularly useful when processing natural language. The Flow Layout property and/or Flow Combine Method is available to the following Collation Providers:

You may download and import the file below into your own Grooper environment (version 2024). This contains a Batch with the example document(s) discussed in this article and a Project containing a Content Model configured according to its instructions.

About

When extracting data from a document's text, there are three important Data Context relationships to consider:

Structural

  • Data always has a structure to it that indicates what that piece of data is. Specific characters in a specific order give data its structural syntax. For example, dates have a certain syntax that makes it obvious what you're reading is a date. When you see 12/25/2020, you instantly know this collection of numbers and slashes is a date. This is because the syntax of two numbers followed by a slash followed by two numbers followed by four numbers is a standard month, date, year format that makes it clear you're looking at a date. Without the slashes, it's less clear. "12252020" could be a date, but it could also just be a string of eight numbers.

Semantic

  • Words themselves follow a syntax. The alphabet "a" through "z" are the characters in that syntax. However, the characters "turtle" and "uttelr" are two different things. One is a semi-aquatic reptile. The other is non-sense. The characters "turtle" mean something. They have semantic value. You can use the semantic relationships between pieces of text to target the data you want to return.
  • For example, "Date: 12252020"
    • Without the slashes, "12252020" may be a date or just a bunch of numbers. However, it's clearly a date if you see the word "Date" in front of it. You're using the semantic value of the word "date" to understand that string of digits.

Spatial

  • Spatial relationships refer to how the layout of text on a document informs the meaning of specific data elements. How a label is positioned next to a value provides the context for understanding that value. Its position next to the value uses a spatial relationship. For example, documents may call out data elements horizontally or vertically.
Horizontal Vertical
Date: 12252020 Date:
12252020

Using Flow to Find Values in a Text Block

Understanding these relationships are important to understanding how to target and return the values you want from your text. Flow based methods use a document's text flow as its spatial relationship. English reads left to right from the top of the page to the bottom. While you probably don't even think about it when you're reading, this spatial relationship of characters and words is critical to understanding what you're reading.

Take this intro paragraph from the Wikipedia entry on Linnaeus's two-toed sloths:


If we, as readers, want to know where these sloths live, it's relatively easy, right? We just read along until we find the words "from" and the region "South America".


Flow based collation methods work much the same way. We could set up a Data Type in Grooper using Key-Value Pair Collation with Flow Layout Enabled. The Key extractor would locate the phrase "from" on the document. The Value extractor would locate the region, "South America". Using Flow Layout, the Data Type reads through the text much like you would as an English reader. From the point the Key "from" is located, the Data Type steps through the characters going left to right and down the page looking for the Value "South America".

  1. For Data Type ojbects...
  2. ...the Flow Layout property can be Enabled when using specific Collation settings, like Key-Value Pair.
  3. Rather than using hte horizontal or vertical alignment of a Key next to a Value, it uses the flow of the text data.
  4. With the Value coming after the Key in the text, ilt is returned as a result.


In the example above, the value was right next to the key, but the beauty of the Key-Value Pair Collation's flow layout is it can be set up to scrawl past multiple characters until it finds the value it's looking for.

  • Consider the following text snippet:
"...from tropical rainforests in northern South America."
  • In this case the Key-Value Pair travels along the text flow until it finds the value, passing multiple characters until the correct value is located.


For more information on how to set up a Key-Value Pair using Flow Layout, see the How To section of this article.

The Array, Ordered Array, and Key-Value List Collation Providers can also utilize this Flow Layout.

Using Flow to Combine

The "flow" method can also return the text between two or more values in a text flow. The The Combine, Array, and Ordered Array Collation Providers all have a "Combine Method" property. Setting this property to "Flow" will combine all the characters between the returned values in the text flow.

  1. Any Collation Provider with a Combine Method property can be set to Flow.
  2. Consider the following text snippet:
"...from tropical rainforests in northern South America."
  • "from" is found as well as "South America".
  1. The Flow Combine Method returns each result and every character between them in the text flow, producing the text snippet given above as one result.


This method is useful to return larger sections of text where you can anchor off individual values within the full document.

Use Cases

Flow Collation methods are part of Grooper's natural language processing solution. Unstructured documents, such as contracts, use language to define data, such as the terms in the contract. Natural language presents several challenges for data extraction, including the fact that that data may exist in various ways in a paragraph flow that is not easily predictable.

What you can predict, however, is that the text will follow a normal lexical flow. For documents in English, you know text will always read left to right and top to bottom. Flow methods utilize this structure for various data extraction purposes.

For example, take the two oil and gas leases below. The term "Lessee" should relate to a particular person, company, or other party. This could be an opportunity to use Key-Value Pair collation, using "Lessee" as the key to find the party's name. Using the Vertical or Horizontal layouts would not be adequate methods to find the lessee in these contracts. There's no guarantee that the party will be listed beside a key vertically or horizontally.

However, we can infer that the word "Lessee" comes after the lessee party in the contract. Using Flow layout, we can take advantage of this and return the parties.

How To

For this example we will configure a Data Type with Key-Value Pair Collation using Flow Layout. It will have two child Value Readers. The first Value Reader will serve as the "Key" and use a List Match extractor type to find a key phrase within an oil and gas lease document. The second Value Reader will serve as the "Value" and reference another extractor pre-configured to find date patterns.

With this configuration the parent Data Type will find a specific kind of date within the flow of language of the provided document.

Create and Configure a Data Type

  1. Right-click on your Project, or a folder within the Project.
  2. From the pop-out menu select "Add > Data Type".
  3. In the "Add" window name the Data Type.
  4. Click the "Execute" button.


  1. With the Data Type created, click the drop-down button for the Collation property.
  2. From the drop-down menu, select Key-Value Pair.


  1. Click the drop-down arrow to the left of the Collation property to expose the sub-properties of the Key-Value Pair collation method.
  2. Click the check box on the Flow Layout property to enable it.
  3. Set the Maximum Character Distance property to 50.
    • This property "skip" over a number of characters that may exist between the Key and the Value. This integer value is arbitrary, but should be a close approximation of the amount of characters you may expect to exist between the Key and a potential Value.
    • You may instead want to use the Separator Expression property. Here, you can set a regular expression that will match a pattern of characters you may expect to exist between the Key and potential Value. You could use a pattern like .* to match zero to many characters of any kind.
  4. Click the "Save" button to save all changes.

Add Child "Key" Extractor

  1. Right-click the newly created and configured Data Type.
  2. From the pop-out menu select "Add > Value Reader".
  3. In the "Add" window name the Value Reader.
  4. Click the "Execute" button.


  1. With the Value Reader created, click the drop-down for the Extractor property.
  2. From the drop-down menu, select List Match.
    • This configuration is specific to this example. This extractor could be configured to get just about anything, so feel free to configure this extractor to target your specific data.


  1. With the Extractor property set, click the "Save" button to save changes.
  2. Click the ellipsis button on the Extractor property to open the "Extractor" window.


  1. In the "Extractor" window, select an appropriate Document from an appropriate Batch from the Batch Viewer.
  2. Enter the desired pattern into the Local Entries field.
    • For the purposes of this example that pattern will be:
made and entered into
  1. The result will be highlighted in the Document Viewer, and displayed in the Results list.

Add Child "Value" Extractor

  1. Right-click the parent Data Type.
  2. From the drop-down menu select "Add > Value Reader".
  3. In the "Add" window name the Value Reader.
  4. Click the "Execute" button.


  1. With the Value Reader created, click the drop-down for the Extractor property.
  2. From the drop-down menu, select Reference.
    • This configuration is specific to this example. This extractor could be configured to get just about anything, so feel free to configure this extractor to target your specific data.


  1. With the Extractor property set to Reference, click the drop-down arrow to the left of the property to expose its sub-properties.
  2. Click the drop-down button on the Extractor sub-property.
  3. Select an extractor for this Value Reader to reference. It should be configured to extract the type of data this extractor is meant to collect.
    • For the purposes of this example, the provided Value-Reader titled "VE_Date" will work.

Test the Results

  1. Select the parent Data Type.
  2. Click on the "Tester" tab.
  3. Be sure an appropriate Document is selected from an appropriate Batch in the Batch Viewer.
  4. Be sure the "Auto Extract" toggle is activated.
  5. In the Document Viewer you will see a blue box around the "key" value, the characters skipped from the Maximum Character Distance property will not be highlighted, and a green highlight will be around the returned value.
  6. The returned value will also be displayed in the Result List.

Glossary

Array: Array is a Collation Provider option for pin Data Type extractors. Array matches a list of values arranged in horizontal, vertical, or text-flow order, combining instances that qualify into a single result.

Batch: inventory_2 Batch nodes are fundamental in Grooper's architecture. They are containers of documents that are moved through workflow mechanisms called settings Batch Processes. Documents and their pages are represented in Batches by a hierarchy of folder Batch Folders and contract Batch Pages.

Collation Provider: The Collation property of a pin Data Type defines the method for converting its raw results into a final result set. It is configured by selecting a Collation Provider. The Collation Provider governs how initial matches from the Data Type's extractor(s) are combined and interpreted to produce the Data Type's final output.

Combine: Combine is a Collation Provider option for pin Data Type extractors. Combine combines instances from returned results based on a specified grouping, controlling how extractor results are assembled together for output.

Content Model: stacks Content Model nodes define a classification taxonomy for document sets in Grooper. This taxonomy is defined by the collections_bookmark Content Categories and description Document Types they contain. Content Models serve as the root of a Content Type hierarchy, which defines Data Element inheritance and Behavior inheritance. Content Models are crucial for organizing documents for data extraction and more.

Data Context: Data Context refers to contextual information used to extract data, such as a label that identifies the value you want to collect.

Data Model: data_table Data Models are leveraged during the Extract activity to collect data from documents (folder Batch Folders). Data Models are the root of a Data Element hierarchy. The Data Model and its child Data Elements define a schema for data present on a document. The Data Model's configuration (and its child Data Elements' configuration) define data extraction logic and settings for how data is reviewed in a Data Viewer.

Data Type: pin Data Types are nodes used to extract text data from a document. Data Types have more capabilities than quick_reference_all Value Readers. Data Types can collect results from multiple extractor sources, including a locally defined extractor, child extractor nodes, and referenced extractor nodes. Data Types can also collate results using Collation Providers to combine, sift and manipulate results further.

Extract: export_notes Extract is an Activity that retrieves information from folder Batch Folder documents, as defined by Data Elements in a data_table Data Model. This is how Grooper locates unstructured data on your documents and collects it in a structured, usable format.

Flow Collation: "Flow Collation" refers to the text-flow based layout option used by various Collation Providers forpin Data Type extractors.

Key-Value List: Key-Value List is a Collation Provider option for pin Data Type extractors. Key-Value List matches instances where a key and a list of one or more values appear together on the document, adhering to a specific layout pattern.

Key-Value Pair: Key-Value Pair is a Collation Provider option for pin Data Type extractors. Key-Value Pair matches instances where a key is paired with a value on the document in a specific layout. Note: Key-Value Pair is an older technique in Grooper. In most cases, the Labeled Value extractor is preferable to Key-Value Pair collation.

Node Tree: The Node Tree is the hierarchical list of Grooper node objects found in the left panel in the Design Page. It is the basis for navigation and creation in the Design Page.

OCR: OCR is stands for Optical Character Recognition. It allows text on paper documents to be digitized, in order to be searched or edited by other software applications. OCR converts typed or printed text from digital images of physical documents into machine readable, encoded text.

Ordered Array: Ordered Array is a Collation Provider option for pin Data Type extractors. Ordered Array finds sequences of values where one result is present for each extractor, in the order they appear, according to a specified horizontal, vertical or text-flow layout.

Project: package_2 Projects are the primary containers for configuration nodes within Grooper. The Project is where various processing objects such as stacks Content Models, settings Batch Processes, profile objects are stored. This makes resources easier to manage, easier to save, and simplifies how node references are made in a Grooper Repository.

Reference: Reference is a Value Extractor used to reference an Extractor Node. This allows users to create re-usable extractors and use the more complex pin Data Type and input Field Class extractors throughout Grooper.

Test Batch: "Test Batch" is a specialized Import Provider designed to facilitate the import of content from an existing inventory_2 Batch in the test environment. This provider is most commonly used for testing, development, and validation scenarios, and is not intended for production use.

  • Looking for information on "production" vs "test" Batches in Grooper? See here.

Value Reader: quick_reference_all Value Reader nodes define a single data extraction operation. Each Value Reader executes a single Value Extractor configuration. The Value Extractor determines the logic for returning data from a text-based document or page. (Example: Pattern Match is a Value Extractor that returns data using regular expressions).

  • Value Readers are can be used on their own or in conjunction with pin Data Types for more complex data extraction and collation.