2023:Data Context (Concept): Difference between revisions
Dgreenwood (talk | contribs) No edit summary |
Dgreenwood (talk | contribs) |
||
| Line 368: | Line 368: | ||
</tab> | </tab> | ||
|} | |} | ||
{|cellpadding=10 cellspacing=5 | |||
| | |||
[[File:Data context - spatial 05.png]] | |||
|valign=top| | |||
How are the grey boxes spatially related? | |||
<tab collapsed name="Click Me" style="margin:10px"> | |||
Not a trick question. They are in ''horizontal alignment'' with each other. | |||
The green boxes are also in horizontal alignment with each other. | |||
</tab> | |||
How are the green boxes different from the grey boxes? | |||
<tab collapsed name="Click Me" style="margin:10px"> | |||
There's simply more boxes in the green row. Sometimes, it's important ''how many'' of one thing or another share a spatial relationship or just that multiple items share this relationship. | |||
This can give import context to what this group of data is and means for your extraction. | |||
</tab> | |||
Revision as of 14:40, 11 August 2020
Data without context is meaningless.
About
Context is critical to understanding and modeling the relationships between pieces of information on a document. Without context, it’s impossible to distinguish one data element from another. Context helps us understand what data refers to or “means”.
This allows us to build an extraction logic using Data Type and Field Class extractors in order to build and populate a Data Model.
There are three fundamental data context relationships:
- Syntactic - Context given by the syntax of data.
- Semantic - Context given by the lexical content associated with the data.
- Spatial - Context given by where the data exists on the page, in relationship with other data.
Using the context these relationships provide allows us to understand how to target data with extractors.
Syntactic
All data follows some kind of syntax. Characters, their positions in a text string, and the order in which they appear inform what that data is. For example, US currency is easily identified by its syntax. Once you see a dollar sign ("$") followed by some digits ("0" through "9"), you instantly know that piece of information is referring to currency values.
Not only is the dollar sign itself important, but its position is important as well. If you see "$100", you instantly know you're talking about a hundred dollars. But, if you see "100$", it's more confusing. Maybe this refers to a currency value, but maybe it doesn't. It does not have that agreed upon understanding of currency values provided by the syntactic context of a dollar sign at the beginning of the numbers.
Similarly, dates follow some agreed upon syntactic structures.
| 07/20/1969 | ||
| 07.20.1969 | ||
| 07-20-1969 | ||
You may not have thought a lot about it, but you actually know quite a bit just from the syntax of a date. First, you know months, dates, and years are separated by certain characters, the slashes, hyphens, and dots. You know (for certain parts of the world) the first series of digits refers to a month, the middle to a day, and the last to a year. This is purely understood by the syntax used, the kinds of characters used and their order in the text string.
The string below uses the same characters as the dates below but a different configuration. This is no longer a date! The syntactic context is no longer there.
| 20-1969/07 |
Even English language follows a syntax! It has an alphabet, a list of characters used ("A" through "Z"). It has certain rules, such as adding an "s" (or sometimes "es") to the end of a noun will (usually) make it a plural noun. The apostrophe is a special character used to combine two words like "do not" and "don't". How characters are used and in what way informs how we read the written word.
Syntactic Context and Extraction
The great thing about syntax is it is highly structured. This structure allows us to capture data based on the syntax's pattern. Take our example of the various dates (and the not date) above.
|
The three dates follow a pattern:
The non-date string "20-1969/07" does not follow this pattern.
Targeting these syntactic contexts is quite easy with regular expression pattern matching. If you know the syntax informing what a piece of data is, you can write a regex pattern to match that syntax. In this case, the regex pattern |
Syntactic Context and Regular Expression
|
Data Type and Data Format extractors use regular expression pattern matching to do just this. Here, the Value Pattern is set to a regular expression pattern matching dates:
The unmatching "non-date" doesn't follow any of the date syntaxes. So, it does not match the regular expression and is not returned. |
Semantic
More often than not, syntax alone doesn't provide enough context to identify a value. For example, take a social security number. Typically, an SSN uses a standard syntax.
|
We can easily identify this number as a social security number just from its syntax. |
|
However, without the context the dashes provide, it's much more ambiguous. Maybe the number below is an SSN, but it could be something else. We don't have enough context to tell what this piece of data is. |
In cases like this, or for data that does not have a unique syntax, something else needs to provide context. We simply need more information to determine what this data is.
By far one of the most common ways of identifying data on the page is with language. If we stick a label in front of that number, it becomes much easier to tell what it is.
|
Now we know we're talking about a social security number due to the semantic context of the text label. In a very simple way, because we know what the words mean, we know what the data means. |
Not only can semantic context identify data, but it can distinguish it as well.
|
The data here are all dates, which we can easily infer from the date syntax. However, each label further identifies each date and distinguishes them from one another. The order date from the delivery date from the due date. At the end of the day, how useful is it that what you're seeing is a date? You want to know what that date refers to. What makes it different from another date. What it means. That is what semantic context helps you do. |
Semantic Context and Extraction
Semantic context can also be targeted by regular expression. A word or phrase providing context can be explicitly matched. You just need to know what word or phrase is providing the context!
|
While the regular expression
|
Semantic Context and Regular Expression
|
For the example above, ultimately we just want the date. We don't want the whole string "Order Date: 01/01/2020". We just want the date value "01/01/2020". We still want to use the semantic context of our label, when it comes to identifying the date. But when it comes time to returning values, we don't actually want it.
Essentially, we can break up our longer pattern
|
Semantic Context and Key-Value Pair Collation
A Key-Value Pair extractor refers to a Data Type whose Collation property has been set to Key-Value Pair. This is easily the most common way semantic context is used to target data in Grooper.
The Key Extractor will locate the text label for a particular value, in our case using the semantic context of the "order date". Then, Grooper will look for a result returned by the (Paired) Value Extractor that is nearby. If one is found, it will pass that value up to the parent Data Type as the ultimately returned value.
|
The Key Extractor
|
|
|
The (Paired) Value Extractor
|
|
|
The Collation Provider Collation Providers manipulate and re-order extraction results. As is, this extractor is returning four results, the phrase "order date" and the three dates.
|
|
|
The Key-Value Pair Layout The critical part of the Key-Value Pair setup is its Layout setting. This determines where the extractor "looks" for the value in relation to the key. This can be "Horizontal", "Vertical", or "Flow"
|
| FYI | The Key and (Paired) Value Extractors can also be the parent Data Type's internal Pattern extractor results as well as a Referenced Extractor. However, there are some specific rules for which counts as the "key" and which counts as the "value" when you use these properties instead of child extractor objects. It all matters what value is returned "first". The "key" extractor must always execute first in the order of operations. The basic order of extractor execution is this: Pattern > Children > References
However, to avoid confusion, most users will use two child extractors, even if one or both of those children are Data Types that reference other extractors. |
Spatial
Spatial relationships refer to how some object is located in space to another reference object. Understanding and using spatial relationships is critical for successful extraction techniques. It's so critical we often don't even realize we're using them.
Going back to our very first example, we discussed how "$100" is easily distinguishable as a currency value, but "100$" is more ambiguous data. This is purely because of the spatial relationship between the dollar sign "$" and the numerical value "100". When the dollar sign is physically located before the number, it's clear the value is a currency value. The simple spatial context of a dollar sign in front of the number instead of behind it makes this data mean something.
Written language itself has a spatial relationship. We read English from left to right and from the top of the page to the bottom of the page. It doesn't make sense any other way! (For English anyway) Spaces give the important spatial context of where one word starts and another stops. Indention give readers spatial clues as to where paragraphs begin.
While we often take these spatial contexts for granted, they can become crucially important for understanding how to target and extract the data you want.
We don't even need text to understand spatial context.
|
How are the two green boxes spatially related? The green boxes are in horizontal alignment with each other. All the boxes are pretty close to one another. But, what makes the green boxes different (besides being green) is that they are next to each other in space horizontally. This spatial relationship distinguishes them from the grey one. |
|
Here, the green boxes are in vertical alignment with each other. One is on top of the other. |
|
They are in horizontal alignment with the blue circles. Remember Key-Value Pair collation? Our earlier example worked much the same way. The value you're looking for (Here, the green boxes) are distinguished by something else (Here, the blue circles) being spatially related in one way or another (Here, the boxes being next to the circles horizontally). |
|
Spatial relationships provide important context to what it is you want to find. The green box here is unique in terms of the context of the blue circle above it. The vertical alignment between the two objects is important to distinguishing the green box from the grey shapes. Note: The green box technically has a diagonal relationship with the other blue circles. However, when it comes to locating and extracting data elements on documents, this spatial relationship is fairly uncommon (Which is not to say that there aren't ways to use diagonal relationships in Grooper!). |
|
How are the grey boxes spatially related? Not a trick question. They are in horizontal alignment with each other. The green boxes are also in horizontal alignment with each other. How are the green boxes different from the grey boxes? There's simply more boxes in the green row. Sometimes, it's important how many of one thing or another share a spatial relationship or just that multiple items share this relationship. This can give import context to what this group of data is and means for your extraction. |











