2.80:Flow Collation (Concept)

From Grooper Wiki
Revision as of 14:29, 13 April 2020 by Dgreenwood (talk | contribs) (Created page with "<blockquote style="font-size:14pt"> Flow collation methods allow Data Type extractors to parse data using the the flow of text within a document. </blockquote> This is partic...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Flow collation methods allow Data Type extractors to parse data using the the flow of text within a document.

This is particularly useful when processing natural language. The "Flow" property is available to the following Collation Providers:

About

When extracting data from a document's text, there are three important relationships to consider:

Syntactical

  • Data always has a syntax to it that indicates what that piece of data is. Specific characters in a specific order give data its syntax. For example, dates have a certain syntax that makes it obvious what you're reading is a date. When you see 12\25\2020, you instantly know this collection of numbers and slashes is a date. This is because the syntax of two numbers followed by a slash followed by two numbers followed by four numbers is a standard month, date, year format that makes it clear you're looking at a date. Without the slashes, it's less clear. "12252020" could be a date, but it could also just be a string of eight numbers.

Semantic

  • Words themselves follow a syntax. The alphabet "a" through "z" are the characters in that syntax. However, the characters "turtle" and "uttelr" are two different things. One is a semi-aquatic reptile. The other is non-sense. The characters "turtle" mean something. They have semantic value. You can use the semantic relationships between pieces of text to target the data you want to return.
  • For example, "Date: 12252020"
    • Without the slashes, "12252020" may be a date or just a bunch of numbers. However, it's clearly a date if you see the word "Date" in front of it. You're using the semantic value of the word "date" to understand that string of digits.

Spatial

  • Spatial relationships refer to how the layout of text on a document informs the meaning of specific data elements. How a label is positioned next to a value provides the context for understanding that value. Its position next to the value uses a spatial relationship. For example, documents may call out data elements horizontally or vertically.
Horizontal Vertical
Date: 12252020 Date:
12252020

Understanding these relationships are important to understanding how to target and return the values you want from your text. Flow based methods use a document's text flow as its spatial relationship. English reads left to right from the top of the page to the bottom. While you probably don't even think about it when you're reading, this spatial relationship of characters and words is critical to understanding what you're reading.

Take this intro paragraph from the Wikipedia entry on Linnaeus's two-toed sloths:

If we, as readers, want to know where these sloths live, it's relatively easy, right? We just read along until we find the words "found in" and the region "South America".

Flow based collation methods work much the same way. We could set up a Data Type in Grooper using Key-Value Pair collation in Flow Layout mode. The Key extractor would locate the phrase "found in" on the document. The Value extractor would locate the region.