2024:Flow Collation (Concept)

From Grooper Wiki
Revision as of 16:41, 27 August 2025 by Dgreenwood (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

This article is about an older version of Grooper.

Information may be out of date and UI elements may have changed.

202520242.80

"Flow Collation" refers to the text-flow based layout option used by various Collation Providers forpin Data Type extractors.

This is particularly useful when processing natural language. The Flow Layout property and/or Flow Combine Method is available to the following Collation Providers:

You may download and import the file below into your own Grooper environment (version 2024). This contains a Batch with the example document(s) discussed in this article and a Project containing a Content Model configured according to its instructions.

About

When extracting data from a document's text, there are three important Data Context relationships to consider:

Structural

  • Data always has a structure to it that indicates what that piece of data is. Specific characters in a specific order give data its structural syntax. For example, dates have a certain syntax that makes it obvious what you're reading is a date. When you see 12/25/2020, you instantly know this collection of numbers and slashes is a date. This is because the syntax of two numbers followed by a slash followed by two numbers followed by four numbers is a standard month, date, year format that makes it clear you're looking at a date. Without the slashes, it's less clear. "12252020" could be a date, but it could also just be a string of eight numbers.

Semantic

  • Words themselves follow a syntax. The alphabet "a" through "z" are the characters in that syntax. However, the characters "turtle" and "uttelr" are two different things. One is a semi-aquatic reptile. The other is non-sense. The characters "turtle" mean something. They have semantic value. You can use the semantic relationships between pieces of text to target the data you want to return.
  • For example, "Date: 12252020"
    • Without the slashes, "12252020" may be a date or just a bunch of numbers. However, it's clearly a date if you see the word "Date" in front of it. You're using the semantic value of the word "date" to understand that string of digits.

Spatial

  • Spatial relationships refer to how the layout of text on a document informs the meaning of specific data elements. How a label is positioned next to a value provides the context for understanding that value. Its position next to the value uses a spatial relationship. For example, documents may call out data elements horizontally or vertically.
Horizontal Vertical
Date: 12252020 Date:
12252020

Using Flow to Find Values in a Text Block

Understanding these relationships are important to understanding how to target and return the values you want from your text. Flow based methods use a document's text flow as its spatial relationship. English reads left to right from the top of the page to the bottom. While you probably don't even think about it when you're reading, this spatial relationship of characters and words is critical to understanding what you're reading.

Take this intro paragraph from the Wikipedia entry on Linnaeus's two-toed sloths:


If we, as readers, want to know where these sloths live, it's relatively easy, right? We just read along until we find the words "from" and the region "South America".


Flow based collation methods work much the same way. We could set up a Data Type in Grooper using Key-Value Pair Collation with Flow Layout Enabled. The Key extractor would locate the phrase "from" on the document. The Value extractor would locate the region, "South America". Using Flow Layout, the Data Type reads through the text much like you would as an English reader. From the point the Key "from" is located, the Data Type steps through the characters going left to right and down the page looking for the Value "South America".

  1. For Data Type ojbects...
  2. ...the Flow Layout property can be Enabled when using specific Collation settings, like Key-Value Pair.
  3. Rather than using hte horizontal or vertical alignment of a Key next to a Value, it uses the flow of the text data.
  4. With the Value coming after the Key in the text, ilt is returned as a result.


In the example above, the value was right next to the key, but the beauty of the Key-Value Pair Collation's flow layout is it can be set up to scrawl past multiple characters until it finds the value it's looking for.

  • Consider the following text snippet:
"...from tropical rainforests in northern South America."
  • In this case the Key-Value Pair travels along the text flow until it finds the value, passing multiple characters until the correct value is located.


For more information on how to set up a Key-Value Pair using Flow Layout, see the How To section of this article.

The Array, Ordered Array, and Key-Value List Collation Providers can also utilize this Flow Layout.

Using Flow to Combine

The "flow" method can also return the text between two or more values in a text flow. The The Combine, Array, and Ordered Array Collation Providers all have a "Combine Method" property. Setting this property to "Flow" will combine all the characters between the returned values in the text flow.

  1. Any Collation Provider with a Combine Method property can be set to Flow.
  2. Consider the following text snippet:
"...from tropical rainforests in northern South America."
  • "from" is found as well as "South America".
  1. The Flow Combine Method returns each result and every character between them in the text flow, producing the text snippet given above as one result.


This method is useful to return larger sections of text where you can anchor off individual values within the full document.

Use Cases

Flow Collation methods are part of Grooper's natural language processing solution. Unstructured documents, such as contracts, use language to define data, such as the terms in the contract. Natural language presents several challenges for data extraction, including the fact that that data may exist in various ways in a paragraph flow that is not easily predictable.

What you can predict, however, is that the text will follow a normal lexical flow. For documents in English, you know text will always read left to right and top to bottom. Flow methods utilize this structure for various data extraction purposes.

For example, take the two oil and gas leases below. The term "Lessee" should relate to a particular person, company, or other party. This could be an opportunity to use Key-Value Pair collation, using "Lessee" as the key to find the party's name. Using the Vertical or Horizontal layouts would not be adequate methods to find the lessee in these contracts. There's no guarantee that the party will be listed beside a key vertically or horizontally.

However, we can infer that the word "Lessee" comes after the lessee party in the contract. Using Flow layout, we can take advantage of this and return the parties.

How To

For this example we will configure a Data Type with Key-Value Pair Collation using Flow Layout. It will have two child Value Readers. The first Value Reader will serve as the "Key" and use a List Match extractor to find a key phrase within an oil and gas lease document. The second Value Reader will serve as the "Value" and reference another extractor pre-configured to find date patterns.

With this configuration the parent Data Type will find a specific kind of date within the flow of language of the provided document.

Create and Configure a Data Type

  1. Right-click on your Project, or a folder within the Project.
  2. From the pop-out menu select "Add > Data Type".
  3. In the "Add" window name the Data Type.
  4. Click the "Execute" button.


  1. With the Data Type created, click the drop-down button for the Collation property.
  2. From the drop-down menu, select Key-Value Pair.


  1. Click the drop-down arrow to the left of the Collation property to expose the sub-properties of the Key-Value Pair collation method.
  2. Click the check box on the Flow Layout property to enable it.
  3. Set the Maximum Character Distance property to 50.
    • This property "skip" over a number of characters that may exist between the Key and the Value. This integer value is arbitrary, but should be a close approximation of the amount of characters you may expect to exist between the Key and a potential Value.
    • You may instead want to use the Separator Expression property. Here, you can set a regular expression that will match a pattern of characters you may expect to exist between the Key and potential Value. You could use a pattern like .* to match zero to many characters of any kind.
  4. Click the "Save" button to save all changes.

Add Child "Key" Extractor

  1. Right-click the newly created and configured Data Type.
  2. From the pop-out menu select "Add > Value Reader".
  3. In the "Add" window name the Value Reader.
  4. Click the "Execute" button.


  1. With the Value Reader created, click the drop-down for the Extractor property.
  2. From the drop-down menu, select List Match.
    • This configuration is specific to this example. This extractor could be configured to get just about anything, so feel free to configure this extractor to target your specific data.


  1. With the Extractor property set, click the "Save" button to save changes.
  2. Click the ellipsis button on the Extractor property to open the "Extractor" window.


  1. In the "Extractor" window, select an appropriate Document from an appropriate Batch from the Batch Viewer.
  2. Enter the desired pattern into the Local Entries field.
    • For the purposes of this example that pattern will be:
made and entered into
  1. The result will be highlighted in the Document Viewer, and displayed in the Results list.

Add Child "Value" Extractor

  1. Right-click the parent Data Type.
  2. From the drop-down menu select "Add > Value Reader".
  3. In the "Add" window name the Value Reader.
  4. Click the "Execute" button.


  1. With the Value Reader created, click the drop-down for the Extractor property.
  2. From the drop-down menu, select Reference.
    • This configuration is specific to this example. This extractor could be configured to get just about anything, so feel free to configure this extractor to target your specific data.


  1. With the Extractor property set to Reference, click the drop-down arrow to the left of the property to expose its sub-properties.
  2. Click the drop-down button on the Extractor sub-property.
  3. Select an extractor for this Value Reader to reference. It should be configured to extract the type of data this extractor is meant to collect.
    • For the purposes of this example, the provided Value-Reader titled "VE_Date" will work.

Test the Results

  1. Select the parent Data Type.
  2. Click on the "Tester" tab.
  3. Be sure an appropriate Document is selected from an appropriate Batch in the Batch Viewer.
  4. Be sure the "Auto Extract" toggle is activated.
  5. In the Document Viewer you will see a blue box around the "key" value, the characters skipped from the Maximum Character Distance property will not be highlighted, and a green highlight will be around the returned value.
  6. The returned value will also be displayed in the Result List.