2.80:Flow Collation (Concept)

From Grooper Wiki

This article is about an older version of Grooper.

Information may be out of date and UI elements may have changed.

202520242.80

"Flow Collation" refers to the text-flow based layout option used by various Collation Providers forpin Data Type extractors.

This is particularly useful when processing natural language. The Flow Layout property and/or Flow Combine Method is available to the following Collation Providers:

Glossary

Array: Array is a Collation Provider option for pin Data Type extractors. Array matches a list of values arranged in horizontal, vertical, or text-flow order, combining instances that qualify into a single result.

Batch: inventory_2 Batch nodes are fundamental in Grooper's architecture. They are containers of documents that are moved through workflow mechanisms called settings Batch Processes. Documents and their pages are represented in Batches by a hierarchy of folder Batch Folders and contract Batch Pages.

Collation Provider: The Collation property of a pin Data Type defines the method for converting its raw results into a final result set. It is configured by selecting a Collation Provider. The Collation Provider governs how initial matches from the Data Type's extractor(s) are combined and interpreted to produce the Data Type's final output.

Combine: Combine is a Collation Provider option for pin Data Type extractors. Combine combines instances from returned results based on a specified grouping, controlling how extractor results are assembled together for output.

Content Model: stacks Content Model nodes define a classification taxonomy for document sets in Grooper. This taxonomy is defined by the collections_bookmark Content Categories and description Document Types they contain. Content Models serve as the root of a Content Type hierarchy, which defines Data Element inheritance and Behavior inheritance. Content Models are crucial for organizing documents for data extraction and more.

Data Context: Data Context refers to contextual information used to extract data, such as a label that identifies the value you want to collect.

Data Model: data_table Data Models are leveraged during the Extract activity to collect data from documents (folder Batch Folders). Data Models are the root of a Data Element hierarchy. The Data Model and its child Data Elements define a schema for data present on a document. The Data Model's configuration (and its child Data Elements' configuration) define data extraction logic and settings for how data is reviewed in a Data Viewer.

Data Type: pin Data Types are nodes used to extract text data from a document. Data Types have more capabilities than quick_reference_all Value Readers. Data Types can collect results from multiple extractor sources, including a locally defined extractor, child extractor nodes, and referenced extractor nodes. Data Types can also collate results using Collation Providers to combine, sift and manipulate results further.

Extract: export_notes Extract is an Activity that retrieves information from folder Batch Folder documents, as defined by Data Elements in a data_table Data Model. This is how Grooper locates unstructured data on your documents and collects it in a structured, usable format.

Flow Collation: "Flow Collation" refers to the text-flow based layout option used by various Collation Providers forpin Data Type extractors.

Key-Value List: Key-Value List is a Collation Provider option for pin Data Type extractors. Key-Value List matches instances where a key and a list of one or more values appear together on the document, adhering to a specific layout pattern.

Key-Value Pair: Key-Value Pair is a Collation Provider option for pin Data Type extractors. Key-Value Pair matches instances where a key is paired with a value on the document in a specific layout. Note: Key-Value Pair is an older technique in Grooper. In most cases, the Labeled Value extractor is preferable to Key-Value Pair collation.

Node Tree: The Node Tree is the hierarchical list of Grooper node objects found in the left panel in the Design Page. It is the basis for navigation and creation in the Design Page.

OCR: OCR is stands for Optical Character Recognition. It allows text on paper documents to be digitized, in order to be searched or edited by other software applications. OCR converts typed or printed text from digital images of physical documents into machine readable, encoded text.

Ordered Array: Ordered Array is a Collation Provider option for pin Data Type extractors. Ordered Array finds sequences of values where one result is present for each extractor, in the order they appear, according to a specified horizontal, vertical or text-flow layout.

Reference: Reference is a Value Extractor used to reference an Extractor Node. This allows users to create re-usable extractors and use the more complex pin Data Type and input Field Class extractors throughout Grooper.

Test Batch: "Test Batch" is a specialized Import Provider designed to facilitate the import of content from an existing inventory_2 Batch in the test environment. This provider is most commonly used for testing, development, and validation scenarios, and is not intended for production use.

  • Looking for information on "production" vs "test" Batches in Grooper? See here.

About

When extracting data from a document's text, there are three important Data Context relationships to consider:

Structural

  • Data always has a structure to it that indicates what that piece of data is. Specific characters in a specific order give data its structural syntax. For example, dates have a certain syntax that makes it obvious what you're reading is a date. When you see 12\25\2020, you instantly know this collection of numbers and slashes is a date. This is because the syntax of two numbers followed by a slash followed by two numbers followed by four numbers is a standard month, date, year format that makes it clear you're looking at a date. Without the slashes, it's less clear. "12252020" could be a date, but it could also just be a string of eight numbers.

Semantic

  • Words themselves follow a syntax. The alphabet "a" through "z" are the characters in that syntax. However, the characters "turtle" and "uttelr" are two different things. One is a semi-aquatic reptile. The other is non-sense. The characters "turtle" mean something. They have semantic value. You can use the semantic relationships between pieces of text to target the data you want to return.
  • For example, "Date: 12252020"
    • Without the slashes, "12252020" may be a date or just a bunch of numbers. However, it's clearly a date if you see the word "Date" in front of it. You're using the semantic value of the word "date" to understand that string of digits.

Spatial

  • Spatial relationships refer to how the layout of text on a document informs the meaning of specific data elements. How a label is positioned next to a value provides the context for understanding that value. Its position next to the value uses a spatial relationship. For example, documents may call out data elements horizontally or vertically.
Horizontal Vertical
Date: 12252020 Date:
12252020

Using Flow to Find Values in a Text Block

Understanding these relationships are important to understanding how to target and return the values you want from your text. Flow based methods use a document's text flow as its spatial relationship. English reads left to right from the top of the page to the bottom. While you probably don't even think about it when you're reading, this spatial relationship of characters and words is critical to understanding what you're reading.

Take this intro paragraph from the Wikipedia entry on Linnaeus's two-toed sloths:



If we, as readers, want to know where these sloths live, it's relatively easy, right? We just read along until we find the words "found in" and the region "South America".



Flow based collation methods work much the same way. We could set up a Data Type in Grooper using Key-Value Pair collation in Flow Layout mode. The Key extractor would locate the phrase "found in" on the document. The Value extractor would locate the region, "South America". Using Flow Layout, the Data Type reads through the text much like you would as an English reader. From the point the Key "found in" is located, the Data Type steps through the characters going left to right and down the page looking for the Value "South America".



In the example above, the value was right next to the key, but the beauty of the Key-Value Pair Collation's flow layout is it can be set up to scrawl past multiple characters until it finds the value it's looking for.



For more information on how to set up a Key-Value Pair using Flow Layout, see the How To section of this article.

The Array, Ordered Array, and Key-Value List Collation Providers can also utilize this Flow Layout.

Using Flow to Combine

The "flow" method can also return the text between two or more values in a text flow. The The Combine, Array, and Ordered Array Collation Providers all have a "Combine Method" property. Setting this property to "Flow" will combine all the characters between the returned values in the text flow.



This method is useful to return larger sections of text where you can anchor off individual values within the full document.

Use Cases

Flow Collation methods are part of Grooper's natural language processing solution. Unstructured documents, such as contracts, use language to define data, such as the terms in the contract. Natural language presents several challenges for data extraction, including the fact that that data may exist in various ways in a paragraph flow that is not easily predictable.

What you can predict, however, is that the text will follow a normal lexical flow. For documents in English, you know text will always read left to right and top to bottom. Flow methods utilize this structure for various data extraction purposes.

For example, take the two oil and gas leases below. The term "Lessee" should relate to a particular person, company, or other party. This could be an oportunity to use Key-Value Pair collation, using "Lessee" as the key to find the party's name. Using the Vertical or Horizontal layouts would not be adequate methods to find the lessee in these contracts. There's no guarantee that the party will be listed beside a key vertically or horizontally.

However, we can infer that the word "Lessee" comes after the lessee party in the contract. Using Flow layout, we can take advantage of this and return the parties.

How To

Using Key-Value Pair Collation and Flow Layout

Before you begin

This tutorial assumes:

  • You've created a Content Model and added its Local Resources Folder
  • You know how to add a Data Type to a Local Resources Folder
  • Basic knowledge of regular expressions

This tutorial also uses a Test Batch of oil and gas leases. If you would like to follow along directly, using the documents in this tutorial, download the zip file listed below and import it into your Grooper environment. The file will import into version 2.80 environments.

This tutorial will also name and folder Grooper objects according to our Asset Management guidelines. If you are unfamiliar with our guidelines and are curious why objects in this tutorial are named how they are named, please visit the Asset Management article.

Add the Data Type Extractor

For this example, we will create a Data Type extractor using Key-Value Pair collation in Flow mode. The extractor will return each lease's date.

1. Add a Data Type to the Local Resources Folder of your Content Model. We will name it "VE KVP-F - Lease Date" following our Data Type Naming Conventions guidance.

2. All Key-Value Pairs need two child extractors, one to find the Key and one to find its Value. Add two child Data Types, the first named "KEY" and the second "VALUE". This will reference the extractors we create in the next two steps.



FYI This is an opportunity to keep your subfolders inside the Local Resources Folder nice and organized. A "Value Extractors" folder can house all extractors referenced by your Data Model. A "Key Extractors" folder can house all Key extractors used by Key-Value Pair collated Data Types

Add the Key Extractor

Just like a Horizontal or Vertical Key-Value Pair, a Flow Key-Value Pair's Key extractor returns some kind of label or identifier that gives context to the value you want to return on a document. But instead of being positioned above or beside it, it will be somewhere in the text.

For this document set, there are a handful of standard legal phrases identifying when the lease was made. This distinguishes the date from other dates that may appear in the contract. There are three key phrases used:

"made and entered into"
"made and effective"
"made this"

The Key extractor can be made to find these identifying phrases.

1. Add a Data Type to the "Key Extractors" folder (if you've created it) in the Local Resources Folder. Name it "KEY - Lease Date".

2. Add a child Data Format to the the "KEY - Lease Date" extractor. Name it "Simple Labels".

3. Select the Data Format named "Simple Labels". From here, use the Pattern Editor to make a regex list of keys.

Value Pattern
made and entered into|
made and effective|
made this


Now you have a Key extractor! All we need to do is set it on our "VE KVP-F - Lease Date" extractor.

4. Expand the "VE KVP-F - Lease Date" extractor in the node tree, and select the child Data Type named "KEY".

5. Select the Referenced Extractors property and press the ellipsis button at the end.

6. This will pop up the "Referenced Extractors" window. Press the "Add" button.

7. This will pop up the "Select Items" window. Navigate these folders as you would the Node Tree to find the Data Type created earlier named "KEY - Lease Date". If you're following our foldering convention, expand the following path:

Leases * (local resources) > Key Extractors > KEY - Lease Date

Press "OK" on both popup windows when finished.



The referenced extractor finds and returns the result. Now we have a Key extractor to find our phrases from the list we made.


Add the Value Extractor

Now we need to add an extractor to find just what it is we are looking for. That is the Value extractor's job for Key-Value Pair collation.

We will create a Data Type to find dates in various formats for this document set.

Format 1: ##/##/####

1. Add a Data Type to the Local Resources Folder and simply name it "Date".

2. Add a child Data Format to the "Date" extractor named "##/##/####". This will find standard "mm/dd/yyyy" and "mm-dd-yyyy" formatted dates.

3. Select the Data Format named "##/##/####". From here, use the Pattern Editor to enter a regular expression to find this date format. We will also include a look ahead pattern indicating the value must follow a character that is not a digit, and a look behind character indicating the character after the value is also not a digit. This will help eliminate false positives such as a social security number (i.e. 123-12-1234) which could match otherwise.

Value Pattern
\d{1,2}[/-]\d{1,2}[/-](\d{4}|\d{2})
Look Ahead Pattern
[^\d]|^
Look Ahead Pattern
[^\d]|$


FYI The ^ character is a special character in regular expression matching the "beginning of string". The OCR or PDF text of a document is just one big string of characters. So, ^ would match the beginning of a document. The $ character matches the "end of string" or end of a document.

Including ^ in the look ahead pattern will match a value if it starts at the beginning of a document. Including $ in the look behind pattern will match a value if it stops at the end of a document.

Format 2: Month ##, ####

1. Add another Data Format to the "Date" extractor named "Month ##, ####"

2. For our Value Pattern, we want to match dates such as "June 12, 1985". We will take advantage of the @MonthNames variable, which is simply a list of months in the year. We will also make the comma "optional" with the ? character, meaning the pattern will match whether or not the comma was included before the year.

Value Pattern
(@MonthNames) \d{1,2},? \d{4}



As with any scanned document, a regex pattern is only going to match the document's text if it matches the OCR'd text. Any time you have to scan a document and use OCR to recognize characters, some characters are going to be mistranslated or lost in the mix. This can be a particular problem for unstructured documents because a lot of the use cases center around processing legal documents obtained from a County Records department. If you obtain the physical documents, you'll need to scan and OCR them. If they give you digital copies, you have no control over how they scanned them, and you'll still need to OCR them.

Luckily, Grooper offers a solution with FuzzyRegEx mode. This uses a Levenshtein distance equation to measure the difference between the regex pattern and the matched text. If that difference is minimal enough, the result will be returned.

1. Switch to the Properties tab.

2. Select the Mode property and change it from RegEx to FuzzyRegEx

Now we have a match coming in at 94% confidence. Grooper is 94% "sure" the string of characters it returned matches the regular expression pattern. If you switch over to the "Text" tab, you can see OCR did not pick up the space character between the day and the year. However, it's just one character off of our pattern. FuzzyRegEx allows you to take these "close enough" matches and return them as the value you're looking for.



FYI Calling out a Data Type or Data Format that uses fuzzy mode in its name can be helpful. This will let you and other Design Studio users know this pattern uses fuzzy matching instead of strict literal matching.

We will rename this format to "Month ##, #### [Fuzzy]". This might help you troubleshoot later if the extractor gets some false positives.


Format 3: ##th day of Month, ####

This format is very common in contracts. This will match dates like "the 12th of June, 1985"

1. Add a Data Format named "##th of Month, ####"

Value Pattern
\d{1,2}.{0,4}day of (@MonthNames),? \d{4}
FYI The . character is a "wildcard" in regular expression. It will match any character. .{0,4} will match any character with a length of zero to four, meaning it will match numerical suffixes, such as the "st" in 1st, "nd" in 2nd, "rd" in 3rd, "th" in 4th and so on. Since it can have a "zero" length, it will also match dates that don't use numerical suffixes.



Notice for "Document 7", this pattern does not match. This is because part of the date is on a new line. There are two characters at the end of every line of text the carriage return \r and new line feed \n. Because those characters aren't in our regex pattern, the pattern doesn't match.



Having these characters in the text can be very helpful. They can serve as anchors for the beginning or end of a line (much like ^ and $ are anchors for the beginning or end of a string). This can be particularly helpful for structured document extraction.

However, for unstructured documents, the natural flow of language gets in the way. For anything you want to extract longer than a single word, there is always the possibility the data will be interrupted by a new line of text in the paragraph.

One solution to this problem, is to simply remove the \r\n pairs.

1. Switch to the Properties tab.

2. Expand the Preprocessing Options property by double-clicking it or clicking the caret next to it.

3. Select Ignore Control Characters and check the box next to NewLine

This will remove all \r\n pairs from the text data, allowing the pattern to match.



Reference the Value Extractor

1. Navigate back to the "VE KVP-F Lease Date" extractor and expand its children.

2. Select the "VALUE" child extractor.

3. Select the Referenced Extractors property and press the ellipsis button at the end.

4. This will pop up the "Referenced Extractors" window. Press the "Add" button.

5. This will pop up the "Select Items" window. Navigate these folders as you would the Node Tree to find the Data Type created earlier named "Date". If you're following our foldering convention, expand the following path:

Leases * (local resources) > Date



Set the Parent's Collation to Key-Value Pair

1. Navigate to the "VE KVP-F - Lease Date" Data Type in the node tree.

2. Select the Collation property and choose Key-Value Pair from the dropdown menu.



3. Select the Flow Layout property and change it from Disabled to Enabled.



If you run extraction at this point, you will notice you get no results. This is because Grooper does not know how far in the text past the key you're willing to go to find the value. To do this we need to alter one of two properties, either the Separator Expression or Maximum Character Distance.

4a. Expand the Flow Layout property and select Separator Expression. This property allows you to enter a regular expression pattern. Once the Key is found, Flow Layout will look at this Separator Expression to match text between the key and the value. If the characters between the key and value's result match, the extractor will return the value's result.

The least restrictive expression you can use is .*. This expression will match any character of any length. Essentially, you would be telling the extractor "Keep going along the text flow after the key until you find the value". It doesn't matter if the value is right after the key or forty pages later. This expression will return it as long as the value follows the key somewhere in the text (You can think of this as an analog to not setting a maximum distance using Horizontal or Vertical Layout).



4b. Alternatively, use the Maximum Character Distance to define how many characters can be between the key and the value. For these documents, you may expect the date to be fairly close to the key, only coming a few words after it.

We can set the Maximum Character Distance property to 25. If the extractor finds the value before that 25 character limit, it will be returned. If there's a value after 25 characters, it will not return it.



! If you do not use a Separator Expression or set a Maximum Character Distance, the Key-Value Pair will default to the key and value separated by 0 characters or a single space character.

Effectively, this will only produce key-value matches that are right next to each other in the text flow.

Version Differences

There are no version differences to point out at this time.