2.80:Flow Collation (Concept): Difference between revisions
Dgreenwood (talk | contribs) |
Dgreenwood (talk | contribs) No edit summary |
||
| Line 100: | Line 100: | ||
This tutorial assumes: | This tutorial assumes: | ||
* You've created a '''Content Model''' and added its '''Local Resources Folder''' | * You've created a '''Content Model''' and added its '''Local Resources Folder''' | ||
* You know how to add a [[Data Type]] to a '''Local Resources Folder''' | * You know how to add a '''[[Data Type]]''' to a '''Local Resources Folder''' | ||
* Basic knowledge of regular expressions | * Basic knowledge of regular expressions | ||
| Line 106: | Line 106: | ||
* [[File:Lease Examples 2.80.zip]] | * [[File:Lease Examples 2.80.zip]] | ||
This tutorial will also name and folder Grooper objects according to our [[Asset Management]] guidelines. If you are unfamiliar with our guidelines and are curious why objects in this tutorial are named how they are named, please visit the [[Asset Management]] article. | |||
</tab> | |||
<tab name="Step 1" style="margin:20px"> | |||
==== Add the Data Type Extractor ==== | |||
For this example, we will create a '''Data Type''' extractor using '''''Key-Value Pair''''' collation in ''Flow'' mode. The extractor will return each lease's date. | |||
1. Add a '''Data Type''' to the '''Local Resources Folder''' of your '''Content Model'''. We will name it "VE KVP-F - Lease Date" following our [[Asset Management#Data Type Naming Conventions|Data Type Naming Conventions]] guidance. | |||
2. All Key-Value Pairs need two child extractors, one to find the Key and one to find its Value. Add two child '''Data Types''', the first named "KEY" and the second "VALUE". This will reference the extractors we create in the next two steps. | |||
[[File:Flow 8.png|1000px]] | |||
{|cellpadding="10" cellspacing="5" | |||
|-style="background-color:#36b0a7; color:white" | |||
|style="font-size:14pt"|'''FYI'''||This is an opportunity to keep your subfolders inside the '''Local Resources Folder''' nice and organized. A "Value Extractors" folder can house all extractors referenced by your '''Data Model'''. A "Key Extractors" folder can house all Key extractors used by '''''Key-Value Pair''''' collated '''Data Types''' | |||
|} | |||
</tab> | |||
<tab name = "Step 2" style="margin:20px> | |||
==== Add the Key Extractor ==== | |||
Just like a Horizontal or Vertical Key-Value Pair, a Flow Key-Value Pair's Key extractor returns some kind of label or identifier that gives context to the value you want to return on a document. But instead of being positioned above or beside it, it will be somewhere in the text. | |||
For this document set, there are a handful of standard legal phrases identifying when the lease was made. This distinguishes the date from other dates that may appear in the contract. There are three key phrases used: | |||
{|style="margin: auto" cellpadding=10 | |||
| "made and entered into" || [[file:flow_11.png|center|700px|border]] | |||
|- | |||
| "made and effective" || [[file:flow_10.png|center|700px|border]] | |||
|- | |||
| "made this" || [[file:flow_9.png|center|700px|border]] | |||
|} | |||
The Key extractor can be made to find these identifying phrases. | |||
1. Add a '''Data Type''' to the "Key Extractors" folder (if you've created it) in the '''''Local Resources Folder'''''. Name it "KEY - Lease Date". | |||
2. Add a child '''Data Format''' to the the "KEY - Lease Date" extractor. Name it "Simple Labels". | |||
3. Select the '''Data Format''' named "Simple Labels". From here, use the Pattern Editor to make a regex list of keys. | |||
{|style="margin:auto" | |||
|-style="text-align:center" | |||
|'''Value Pattern''' | |||
|- | |||
| | |||
<pre> | |||
made and entered into| | |||
made and effective| | |||
made this | |||
</pre> | |||
|} | |||
[[File:Flow 12.png|950px]] | |||
Now you have a Key extractor! All we need to do is set it on our "VE KVP-F - Lease Date" extractor. | |||
4. Expand the "VE KVP-F - Lease Date" extractor in the node tree, and select the child '''Data Type''' named "KEY". | |||
5. Select the '''''Referenced Extractors''''' property and press the ellipsis button at the end. | |||
6. This will pop up the "Referenced Extractors" window. Press the "Add" button. | |||
7. This will pop up the "Select Items" window. Navigate these folders as you would the Node Tree to find the '''Data Type''' created earlier named "KEY - Lease Date". If you're following our foldering convention, expand the following path: | |||
<code>Leases (local resources) > Key Extractors > KEY - Lease Date</code> | |||
Press "OK" on both popup windows when finished. | |||
[[File:Flow 13.png|950px]] | |||
</tab> | </tab> | ||
</tabs> | </tabs> | ||
Revision as of 15:09, 15 April 2020
Flow collation methods allow Data Type extractors to parse data using the the flow of text within a document.
This is particularly useful when processing natural language. The "Flow" property is available to the following Collation Providers:
About
When extracting data from a document's text, there are three important relationships to consider:
Syntactical
- Data always has a syntax to it that indicates what that piece of data is. Specific characters in a specific order give data its syntax. For example, dates have a certain syntax that makes it obvious what you're reading is a date. When you see 12\25\2020, you instantly know this collection of numbers and slashes is a date. This is because the syntax of two numbers followed by a slash followed by two numbers followed by four numbers is a standard month, date, year format that makes it clear you're looking at a date. Without the slashes, it's less clear. "12252020" could be a date, but it could also just be a string of eight numbers.
Semantic
- Words themselves follow a syntax. The alphabet "a" through "z" are the characters in that syntax. However, the characters "turtle" and "uttelr" are two different things. One is a semi-aquatic reptile. The other is non-sense. The characters "turtle" mean something. They have semantic value. You can use the semantic relationships between pieces of text to target the data you want to return.
- For example, "Date: 12252020"
- Without the slashes, "12252020" may be a date or just a bunch of numbers. However, it's clearly a date if you see the word "Date" in front of it. You're using the semantic value of the word "date" to understand that string of digits.
Spatial
- Spatial relationships refer to how the layout of text on a document informs the meaning of specific data elements. How a label is positioned next to a value provides the context for understanding that value. Its position next to the value uses a spatial relationship. For example, documents may call out data elements horizontally or vertically.
| Horizontal | Vertical |
| Date: 12252020 | Date: 12252020 |
Using Flow to Find Multiple Values in a Text Block
Understanding these relationships are important to understanding how to target and return the values you want from your text. Flow based methods use a document's text flow as its spatial relationship. English reads left to right from the top of the page to the bottom. While you probably don't even think about it when you're reading, this spatial relationship of characters and words is critical to understanding what you're reading.
Take this intro paragraph from the Wikipedia entry on Linnaeus's two-toed sloths:

If we, as readers, want to know where these sloths live, it's relatively easy, right? We just read along until we find the words "found in" and the region "South America".

Flow based collation methods work much the same way. We could set up a Data Type in Grooper using Key-Value Pair collation in Flow Layout mode. The Key extractor would locate the phrase "found in" on the document. The Value extractor would locate the region, "South America". Using Flow Layout, the Data Type reads through the text much like you would as an English reader. From the point the Key "found in" is located, the Data Type steps through the characters going left to right and down the page looking for the Value "South America".

In the example above, the value was right next to the key, but the beauty of the Key-Value Pair Collation's flow layout is it can be set up to scrawl past multiple characters until it finds the value it's looking for.

For more information on how to set up a Key-Value Pair using Flow Layout, see the How To section of this article.
The Array, Ordered Array, and Key-Value List Collation Providers can also utilize this Flow Layout.
Using Flow to Combine
The Flow method can also return the text between two or more values in a text flow. The The Combine, Array, and Ordered Array Collation Providers all have a "Combine Method" property. Setting this property to "Flow" will combine all the characters between the returned values in the text flow.

This method is useful to return larger sections of text where you can anchor off individual values within the full document.
Use Cases
Flow Collation methods are part of Grooper's natural language processing solution. Unstructured documents, such as contracts, use language to define data, such as the terms in the contract. Natural language presents several challenges for data extraction, including the fact that that data may exist in various ways in a paragraph flow that is not easily predictable.
What you can predict, however, is that the text will follow a normal lexical flow. For documents in English, you know text will always read left to right and top to bottom. Flow methods utilize this structure for various data extraction purposes.
For example, take the two oil and gas leases below. The term "Lessee" should relate to a particular person, company, or other party. This could be an oportunity to use Key-Value Pair collation, using "Lessee" as the key to find the party's name. Using the Vertical or Horizontal layouts would not be adequate methods to find the lessee in these contracts. There's no guarantee that the party will be listed beside a key vertically or horizontally.
However, we can infer that the word "Lessee" comes after the lessee party in the contract. Using Flow layout, we can take advantage of this and return the parties.
Version Differences
There are no version differences to point out at this time.
How To
Using Key-Value Pair Collation and Flow Layout
Before you begin
This tutorial assumes:
- You've created a Content Model and added its Local Resources Folder
- You know how to add a Data Type to a Local Resources Folder
- Basic knowledge of regular expressions
This tutorial also uses a Test Batch of oil and gas leases. If you would like to follow along directly, using the documents in this tutorial, download the zip file listed below and import it into your Grooper environment. The file will import into version 2.80 environments.
This tutorial will also name and folder Grooper objects according to our Asset Management guidelines. If you are unfamiliar with our guidelines and are curious why objects in this tutorial are named how they are named, please visit the Asset Management article.
Add the Data Type Extractor
For this example, we will create a Data Type extractor using Key-Value Pair collation in Flow mode. The extractor will return each lease's date.
1. Add a Data Type to the Local Resources Folder of your Content Model. We will name it "VE KVP-F - Lease Date" following our Data Type Naming Conventions guidance.
2. All Key-Value Pairs need two child extractors, one to find the Key and one to find its Value. Add two child Data Types, the first named "KEY" and the second "VALUE". This will reference the extractors we create in the next two steps.
| FYI | This is an opportunity to keep your subfolders inside the Local Resources Folder nice and organized. A "Value Extractors" folder can house all extractors referenced by your Data Model. A "Key Extractors" folder can house all Key extractors used by Key-Value Pair collated Data Types |
Add the Key Extractor
Just like a Horizontal or Vertical Key-Value Pair, a Flow Key-Value Pair's Key extractor returns some kind of label or identifier that gives context to the value you want to return on a document. But instead of being positioned above or beside it, it will be somewhere in the text.
For this document set, there are a handful of standard legal phrases identifying when the lease was made. This distinguishes the date from other dates that may appear in the contract. There are three key phrases used:
| "made and entered into" | |
| "made and effective" | ![]() |
| "made this" |
The Key extractor can be made to find these identifying phrases.
1. Add a Data Type to the "Key Extractors" folder (if you've created it) in the Local Resources Folder. Name it "KEY - Lease Date".
2. Add a child Data Format to the the "KEY - Lease Date" extractor. Name it "Simple Labels".
3. Select the Data Format named "Simple Labels". From here, use the Pattern Editor to make a regex list of keys.
| Value Pattern |
made and entered into| made and effective| made this |
Now you have a Key extractor! All we need to do is set it on our "VE KVP-F - Lease Date" extractor.
4. Expand the "VE KVP-F - Lease Date" extractor in the node tree, and select the child Data Type named "KEY".
5. Select the Referenced Extractors property and press the ellipsis button at the end.
6. This will pop up the "Referenced Extractors" window. Press the "Add" button.
7. This will pop up the "Select Items" window. Navigate these folders as you would the Node Tree to find the Data Type created earlier named "KEY - Lease Date". If you're following our foldering convention, expand the following path:
Leases (local resources) > Key Extractors > KEY - Lease Date
Press "OK" on both popup windows when finished.


