2023:Data Context (Concept)
Data without context is meaningless.
About
Context is critical to understanding and modeling the relationships between pieces of information on a document. Without context, it’s impossible to distinguish one data element from another. Context helps us understand what data refers to or “means”.
This allows us to build an extraction logic using Data Type and Field Class extractors in order to build and populate a Data Model.
There are three fundamental data context relationships:
- Syntactic - Context given by the syntax of data.
- Semantic - Context given by the lexical content associated with the data.
- Spatial - Context given by where the data exists on the page, in relationship with other data.
Using the context these relationships provide allows us to understand how to target data with extractors.
Syntactic
All data follows some kind of syntax. Characters, their positions in a text string, and the order in which they appear inform what that data is. For example, US currency is easily identified by its syntax. Once you see a dollar sign ("$") followed by some digits ("0" through "9"), you instantly know that piece of information is referring to currency values.
Not only is the dollar sign itself important, but its position is important as well. If you see "$100", you instantly know you're talking about a hundred dollars. But, if you see "100$", it's more confusing. Maybe this refers to a currency value, but maybe it doesn't. It does not have that agreed upon understanding of currency values provided by the syntactic context of a dollar sign at the beginning of the numbers.
Similarly, dates follow some agreed upon syntactic structures.
You may not have thought a lot about it, but you actually know quite a bit just from the syntax of a date. First, you know months, dates, and years are separated by certain characters, the slashes, hyphens, and dots. You know (for certain parts of the world) the first series of digits refers to a month, the middle to a day, and the last to a year. This is purely understood by the syntax used, the kinds of characters used and their order in the text string.
The string below uses the same characters as the dates below but a different configuration. This is no longer a date! The syntactic context is no longer there.
Even English language follows a syntax! It has an alphabet, a list of characters used ("A" through "Z"). It has certain rules, such as adding an "s" or "es" to the end of a noun will often pluralize that noun. The apostrophe is a special character used to combine two words like "do not" and "don't". How characters are used and in what way informs how we read the written word.
Syntactic Data Contexts and Extraction
The great thing about syntax is it is highly structured. This structure allows us to capture data based on the syntax's pattern. Take our example of the various dates (and the not date) above.
The three dates follow a pattern: Two digits, then a slash, hyphen or a dot, then two digits, then another slash, hyphen or dot, and last, four digits.
The "not-date" does not follow that pattern.

