Data Context (Concept)

From Grooper Wiki
(Redirected from Data Context)

This article was migrated from an older version and has not been updated for the current version of Grooper.

This tag will be removed upon article review and update.

This article is about the current version of Grooper.

Note that some content may still need to be updated.

20252024 2023

Data Context refers to contextual information used to extract data, such as a label that identifies the value you want to collect.

About

Context is critical to understanding and modeling the relationships between pieces of information on a document. Without context, it’s impossible to distinguish one data element from another. Context helps us understand what data refers to or “means”.

This allows us to build an extraction logic using Data Type, Field Class or other extractors in order to build and populate a Data Model.

There are three fundamental data context relationships:

  • Structural - Context given by the syntax of data.
  • Semantic - Context given by the lexical content associated with the data.
  • Spatial - Context given by where the data exists on the page, in relationship with other data.

Using the context these relationships provide allows us to understand how to target data with extractors.

You may download the ZIP(s) below and upload it into your own Grooper environment (version 2024). The first contains a Project with resources used in examples throughout this article. The second contains one or more Batches of sample documents.


Structural Context

All data follows some kind of syntax. Characters, their positions in a text string, and the order in which they appear inform what that data is. For example, US currency is easily identified by its syntax. Once you see a dollar sign ("$") followed by some digits ("0" through "9"), you instantly know that piece of information is referring to currency.

Not only is the dollar sign itself important, but its position is important as well. If you see "$100", you instantly know you're talking about a hundred dollars. But, if you see "100$", it's more confusing. Maybe this refers to a currency value, but maybe it doesn't. It does not have that agreed upon understanding of currency values provided by the structural context of a dollar sign at the beginning of the numbers.

Syntax can give you quite a bit of context to understand what your data is and how to target it for extraction.

07/20/1969
07.20.1969
07‑20‑1969

Similarly, dates follow some agreed upon structural structures.

In the most basic sense, a date is just a string of text characters. You may not have thought a lot about it, but you actually know quite a bit just from the syntax of a date.

  • You know months, dates, and years are separated by certain characters, the slashes, hyphens, and dots.
  • You know (for certain parts of the world) the first series of digits refers to a month, the middle to a day, and the last to a year.
  • This information is purely understood by the date syntax, the kinds of characters used and their order in the text string.


20‑1969/07

This string uses the same characters as the dates above but in a different configuration.

This is no longer a date! The structural context of what makes a string of characters a date is no longer there.

Even English language follows a syntax! It has an alphabet, a list of characters used ("A" through "Z"). It has certain rules, such as adding an "s" (or sometimes "es") to the end of a noun will (usually) make it a plural noun. The apostrophe is a special character used to combine two words like "do not" and "don't". How characters are used and in what way informs how we read the written word.

Structural Context and Extraction

The great thing about syntax is it is (often) highly structured. This structure allows us to capture data based on the syntax's pattern. Take our example of the various dates (and the not date) above.

07/20/1969‑‑
07.20.1969‑‑
07‑20‑1969‑‑
20‑1969/07‑‑

The three dates follow a pattern:

  • Two digits, then a slash, hyphen or a dot, then two digits, then another slash, hyphen or dot, and last, four digits.

The non-date string "20-1969/07" does not follow this pattern.


Targeting these structural contexts is quite easy with regular expression pattern matching. If you know the syntax informing what a piece of data is, you can write a regex pattern to match that syntax.

In this case, the regex pattern \d{2}[/-.]\d{2}[/-.]\d{4} would return the top three dates but not the "non-date" at the bottom.

Structural Context and Regular Expression

Data Type and Value Reader extractors use regular expression pattern matching to do just this.

  1. In Grooper, a Pattern Match extractor can use regular expression to extract information using syntactic context. Here, the pattern has been set up to match dates: \d{2}[/.-]\d{2}[/.-]\d{4}
  2. All three dates are returned. The matches are highlighted in green in the Document Viewer and are returning in the Results Viewer.
  3. Notice the last number is not returned as it does not follow the required syntactic pattern of a date.

Semantic Context

More often than not, syntax alone doesn't provide enough context to identify a value. For example, take a social security number. Typically, an SSN uses a standard syntax.

441‑12‑1234

We can easily identify this number as a social security number just from its syntax.

441121234

However, without the context the dashes provide, it's much more ambiguous. Maybe the number is an SSN, but it could be something else. We don't have enough context to tell what this piece of data is.

In cases like this, or for data that does not have a unique syntax, something else needs to provide context. We simply need more information to determine what this data is.

By far one of the most common ways of identifying data on the page is with language. If we stick a label in front of that number, it becomes much easier to tell what it is.

Social Security Number: 441121234

Now we know we're talking about a social security number due to the semantic context of the text label.

In a very simple way, because we know what the words mean, we know what the data means.


Not only can semantic context identify data, but it can distinguish it as well.

OrderDate:01/01/2020
DeliveryDate:01/15/2020
DueDate:01/30/2020

The data here are all dates, which we can easily infer from the date syntax.

However, each label further identifies each date and distinguishes them from one another. The order date from the delivery date from the due date.

At the end of the day, how useful is it that what you're seeing is a date? You want to know what that date refers to. What makes it different from another date. What it means. That is what semantic context helps you do.

Semantic Context and Extraction

Semantic context can also be targeted by regular expression. A word or phrase providing context can be explicitly matched. You just need to know what word or phrase is providing the context!

Order Date: 01/01/2020‑‑
Delivery Date: 01/15/2020‑‑
Due Date: 01/30/2020‑‑

While the regular expression \d{2}[/.-]\d{2}[/.-]\d{4} would match all three dates, what if we only wanted to match the line containing the order date?


The regular expression Order Date: \d{2}[/.-]\d{2}[/.-]\d{4} would match the only the line containing order date, but not any of the others.


We use the semantic context of what the phrase "order date" means in combination with the structural context of how dates are patterned to narrow our result down to what we want.

On top of explicit regular expression to target the words and phrases relating to the data you want to extract, various Collation Providers can prove useful to use language to give context to the data. The Key-Value Pair Collation Provider is a very common way to collate extraction results. For this provider, the "Key Extractor" matches the semantic context for a piece of data, typically a label for a field value. The "Value Extractor" matches the data you want to return. When the extractor uses Key-Value Pair collation, if the Key Extractor's result is close to a valid result returned by the Value Extractor, it will associate these two pieces of information and return the value.

Semantic Context and Regular Expression

For the example above, ultimately we just want the date. We don't want the whole string "Order Date: 01/01/2020". We just want the date value "01/01/2020". We still want to use the semantic context of our label, when it comes to identifying the date. But when it comes time to returning values, we don't actually want it.


There's a lot of ways to do this in Grooper. The most basic way to do this is with the Prefix Pattern of Data Types and Value Readers. For simple cases, like this one, this approach can be very effective. Prefix Patterns match a regular expression before the Value Pattern in the text flow.


Essentially, we can break up our longer pattern Order Date: \d{2}[/.-]\d{2}[/.-]\d{4} matching the full line into two.

  • The value we want to match \d{2}[/.-]\d{2}[/.-]\d{4} will comprise the Value Pattern. This is what we want to return.
  • The label before the value Order Date: will comprise the Prefix Pattern. This is the context for the value we want to return.
  1. Here we still have the regular expression to return dates for our pattern match.
  2. We have put Order Date: as a Prefix Pattern. This way Grooper uses the label here for context, but does not return it as part of the result.
  3. See that we now only get returned the date that we want. The other results are thrown out because they do not have the required Prefix Pattern.

Semantic Context and Labeled Value

A Labeled Value extractor uses both a Label Extractor and a Value Extractor to use semantic context to return a desired result.

  1. On this Dat Type, we have set the Local Extractor to a Labeled Value.
  2. If we open the sub-properties of the Labeled Value Extractor, we see that the Label Extractor has been set to a List Match and the Value Extractor has been set to a Pattern Match.


  1. The List Match has been set to collect the label of Order Date


  1. The Pattern Match has been set using the same pattern as before to collect all dates on the document.


  1. Wwe will test the extraction on the Tester tab.
  2. We see that Grooper is returning the date we want using semantic context of the Labeled Value extractor.
  3. The label is outlined in blue and the result is highlighted in green in the Document Viewer.


Spatial

Spatial relationships refer to how some object is located in space to another reference object. Understanding and using spatial relationships is critical for successful extraction techniques. It's so critical we often don't even realize we're using them.

Going back to our very first example, we discussed how "$100" is easily distinguishable as a currency value, but "100$" is more ambiguous data. This is purely because of the spatial relationship between the dollar sign "$" and the numerical value "100". When the dollar sign is physically located before the number, it's clear the value is a currency value. The simple spatial context of a dollar sign in front of the number instead of behind it makes this data mean something.

Written language itself has a spatial relationship. We read English from left to right and from the top of the page to the bottom of the page. It doesn't make sense any other way! (For English anyway) Spaces give the important spatial context of where one word starts and another stops. Indention give readers spatial clues as to where paragraphs begin.

While we often take these spatial contexts for granted, they can become crucially important for understanding how to target and extract the data you want.

We don't even need text to understand spatial context.

Spatial Alignment

How are the two green boxes spatially related?

The green boxes are in horizontal alignment with each other.


All the boxes are pretty close to one another. But, what makes the green boxes different (besides being green) is that they are next to each other in space horizontally.

This spatial relationship distinguishes them from the brown one.


What about here? How are the two green boxes spatially related?

Here, the green boxes are in vertical alignment with each other. One is on top of the other.

Spatial Anchors


What distinguishes the green boxes from the brown boxes here?

They are in horizontal alignment with the blue circles.

  • The green boxes are anchored in space to the right side of the blue circles. If we know the green boxes are always next to a blue circle, we can use that spatial relationship to find them.

Remember Key-Value Pair collation? Our earlier example of a Key-Value Pair Data Type extractor worked much the same way.

  • The value you're looking for (Here, the green boxes) are distinguished by something else (Here, the blue circles) being spatially related in one way or another (Here, the boxes being next to the circles horizontally).

Spatial relationships provide important context to what it is you want to find.

Here, many shapes share a vertical alignment with the blue circle above it.

  • However, between the green box and the brown box, only the green box has a circle above it. The brown box has a circle to its left side.

The green box and brown box have different spatial contexts in terms of their relationship with blue circles. The green box is spatially anchored to the blue circle above it, whereas the brown box shares a different spatial relationship to a blue circle.

  • When it comes to finding what you want, distinguishing between data's spatial relationships to other data can be critical to properly locating it on the page and extracting it.

How are the two green boxes related here?

They're between two blue circles.

Sometimes, you can find data just knowing it's anchored between other data. This can be helpful to distinguish between other similar data, or if you know you want to return whatever is between two known pieces of information.

Spatial Order

How are the brown boxes spatially related?

Not a trick question. They are in horizontal alignment with each other.

The green boxes are also in horizontal alignment with each other.

How are the green boxes different from the brown boxes?

There's simply more boxes in the green row. Sometimes, it's important how many of one thing or another share a spatial relationship or just that multiple items share this relationship.

This can give import context to what this group of data is and means for your extraction.

How is the green row of shapes similar to the two brown rows of shapes?

For all three rows, the shapes in each row are aligned horizontally.

All three rows have the same four shapes.

How is the green row of shapes different to the two brown rows of shapes?

The shapes are in a different order.

Sometimes alignment is only half the spatial battle. The order in which items appear (first, second, third, etc.) can itself be critical context to telling one group of data from another.

Spatial Context and Extraction

In Grooper, spatial contexts are used in a variety of ways, but can be placed into two main categories.

  • Through control and anchor characters - The ^, $, \r, \n, \t, and \f that denote large amounts of whitespace and positional information.
  • Through Collation Providers - Including (but not limited to) Key-Value Pair, Key-Value List, Array, Ordered Array, Split.


Spatial Anchor Characters

In the document seen here, there are three columns, each with pairs of numbers of various lengths.

  • We have written a regular expression here to collect all of these numbers with varying lengths from the document. \d+ \d+


However, what if you want to match numbers only in one column or the other?

We need to provide more context to what makes these columns distinct. There's a clear interplay of spatial relationships here. Columns are just collections of text separated in space. Each column is distinguished by a large amount of space between them. We know where the first column starts because it's at the beginning of a line. We know where the last column ends because it's at the end of a line.

This is very easy to understand visually. We just need a way to put this idea into practice with regular expression. We do this through anchor and control characters.

  • ^ - Beginning of string. This character will match the beginning of a document's text flow (or beginning of a data instance).
  • $ - End of string. This character will match the end of a document's text flow (or beginning of a data instance).
  • \n - New Line. This character will match the start of a line of text.
  • \r - Carriage Return. This character will match the end of a line of text.
  • \f - Form Feed. This character will match the start of a new page for multi-page documents.
  • \t - Tab Character. This character will match large amounts of white space (The Tab Marking properties of a Data Type or Value Reader will determine how wide spaces between characters must be to count as a "tab" and not a single space.)

FYI

If you want to be technical about it, ^ and $ are anchor characters and \r, \n, \t, and \f are control characters. However, for our purposes here, they are all used as spatial anchors, reference points on the page that provide context to where a piece of data is. While it may not be technically correct as far as regex goes to call control characters "anchors", as far as spatial logic goes, they are characters in the text flow used to anchor other text to some point on the page.


Matching the First Column

What do we know about the first column of numbers? What distinguishes them on the page from the other columns?

They're the only ones at the beginning of a line! Knowing this, we can use the control character \n as a Prefix Pattern. The \n character always comes before the numbers in the first column in the text flow. We use the spatial context of the beginning of the page to narrow down the values we want to extract.

  • To collect only the first column, we have added the control character of \n as a prefix pattern. Only values that start on a new line will be returned.


Matching the Last Column

What do we know about the last column of numbers? What distinguishes them on the page from the other columns?

Same story, different character. They're at the end of a line of text. We can use the control character \r as a Suffix Pattern this time. The \r character comes after all the numbers in this column. This is why we have a pair of control characters at the end of every line instead of just one. The \r provides context for the end of a line where \n provides the context for the start.

  • To collect only the last column, we have added the control character of \r as a suffix pattern. Only values that end with a carriage return will be collected.


Matching the Middle Column

What about the middle column? How are these numbers different from the numbers in the other columns?

The numbers in the middle column are isolated by large amounts of space on either side of them. This can be targeted through the tab character. Typically, whether a single space between characters or a large gap between characters, all white space gaps are translated as a space character in the text data. With Tab Marking enabled, space characters are replaced with tab characters for wider gaps between characters. By default, Tab Marking is disabled, so you will need to enable it.

  1. Click on the Properties tab.
  2. Open up the Preprocessing sub-properties.
  3. Click the checkbox next to Tab Marking to enable it. Now you should be able to use \t for tabs in your regex.


With Tab Marking enabled, and \t as our Prefix and Suffix Pattern, you can see our pattern matches. Now that we have tab characters in the text data, we can use those characters as spatial anchors to the wide white space gap between text segments.

  • To collect the middle column, we need to use both a Prefix and Suffix pattern. We want anything that matches the Value Pattern between two tabbed spaces.