2023:Data Context (Concept): Difference between revisions

Revision as of 07:54, 11 August 2020

Data without context is meaningless.

About

Context is critical to understanding and modeling the relationships between pieces of information on a document. Without context, it’s impossible to distinguish one data element from another. Context helps us understand what data refers to or “means”.

This allows us to build an extraction logic using Data Type and Field Class extractors in order to build and populate a Data Model.

There are three fundamental data context relationships:

Syntactic - Context given by the syntax of data.
Semantic - Context given by the lexical content associated with the data.
Spatial - Context given by where the data exists on the page, in relationship with other data.

Using the context these relationships provide allows us to understand how to target data with extractors.

Syntactic

All data follows some kind of syntax. Characters, their positions in a text string, and the order in which they appear inform what that data is. For example, US currency is easily identified by its syntax. Once you see a dollar sign ("$") followed by some digits ("0" through "9"), you instantly know that piece of information is referring to currency values.

Not only is the dollar sign itself important, but its position is important as well. If you see "$100", you instantly know you're talking about a hundred dollars. But, if you see "100$", it's more confusing. Maybe this refers to a currency value, but maybe it doesn't. It does not have that agreed upon understanding of currency values provided by the syntactic context of a dollar sign at the beginning of the numbers.

Similarly, dates follow some agreed upon syntactic structures.


	07/20/1969
	07.20.1969
	07-20-1969

You may not have thought a lot about it, but you actually know quite a bit just from the syntax of a date. First, you know months, dates, and years are separated by certain characters, the slashes, hyphens, and dots. You know (for certain parts of the world) the first series of digits refers to a month, the middle to a day, and the last to a year. This is purely understood by the syntax used, the kinds of characters used and their order in the text string.

The string below uses the same characters as the dates below but a different configuration. This is no longer a date! The syntactic context is no longer there.

20-1969/07

Even English language follows a syntax! It has an alphabet, a list of characters used ("A" through "Z"). It has certain rules, such as adding an "s" or "es" to the end of a noun will often pluralize that noun. The apostrophe is a special character used to combine two words like "do not" and "don't". How characters are used and in what way informs how we read the written word.

Syntactic Context and Extraction

The great thing about syntax is it is highly structured. This structure allows us to capture data based on the syntax's pattern. Take our example of the various dates (and the not date) above.


	07/20/1969 ✔
	07.20.1969 ✔
	07‑20‑1969 ✔
	20‑1969/07 ✘

The three dates follow a pattern:

Two digits, then a slash, hyphen or a dot, then two digits, then another slash, hyphen or dot, and last, four digits.

The non-date string "20-1969/07" does not follow this pattern.

Targeting these syntactic contexts is quite easy with regular expression pattern matching. If you know the syntax informing what a piece of data is, you can write a regex pattern to match that syntax.

In this case, the regex pattern \d{2}[/-.]\d{2}[/-.]\d{4} would return the top three dates but not the "non-date" at the bottom.

Syntactic Context and Grooper ExtractionSyntactic Context and Grooper Extraction

Data Type and Data Format extractors use regular expression pattern matching to do just this.

Here, the Value Pattern is set to a regular expression pattern matching dates:

\d{2}[/-.]\d{2}[/-.]\d{4}

This regex pattern matches all the dates using the three date syntaxes on the page ("##/##/####" "##.##.####" and "##-##-####"). The matches are highlighted in green in the Document Viewer and returning in the Results Viewer.

The unmatching "non-date" doesn't follow any of the date syntaxes. So, it does not match the regular expression and is not returned.

Semantic

More often than not, syntax alone doesn't provide enough context to identify a value. For example, take a social security number. Typically, an SSN uses a standard syntax.

441‑12‑1234

We can easily identify this number as a social security number just from its syntax.

441121234

However, without the context the dashes provide, it's much more ambiguous. Maybe the number below is an SSN, but it could be something else. We don't have enough context to tell what this piece of data is.

In cases like this, or for data that does not have a unique syntax, something else needs to provide context. We simply need more information to determine what this data is.

By far one of the most common ways of identifying data on the page is with language. If we stick a label in front of that number, it becomes much easier to tell what it is.

Social Security Number: 441121234

Now we know we're talking about a social security number due to the semantic context of the text label.

In a very simple way, because we know what the words mean, we know what the data means.

Not only can semantic context identify data, but it can distinguish it as well.


	Order Date: 01/01/2020
	Delivery Date: 01/15/2020
	Due Date: 01/30/2020

The data here are all dates, which we can easily infer from the date syntax.

However, each label further identifies each date and distinguishes them from one another. The order date from the delivery date from the due date.

At the end of the day, how useful is it that what you're seeing is a date? You want to know what that date refers to. What makes it different from another date. What it means. That is what semantic context helps you do.

Semantic Context and Extraction

Semantic context can also be targeted by regular expression.

@@ Line 108: / Line 108: @@
 |
 {|style="background-color:#616364; color:#f89420; font-size:20pt; margin-left:20px" cellpadding=20
-|'''441-12-1234'''
+|'''441&#8209;12&#8209;1234'''
 |}
 |