Header-Value (Table Extract Method)

From Grooper Wiki
Jump to navigation Jump to search
The Header-Value Extract Method's properties

Header-Value is one of three methods available to Data Table elements to extract information from tables on a document set. It uses a combination of column header and column value extractors to determine the table’s structure and extract information from the table’s cells.


About

Where the Row Match method focuses on using a table’s rows to model table structure and extract data, ‘’’Header-Value’’’ looks to the table’s columns. Extractors are used to find the header labels and the values in those columns. This is very similar to how a human being reads a table. Imagine you're trying to find a piece of information from the table below, say a particular order identification number. The first thing you're going to do is look for the Order ID column. That is what the Header Extractor does. Then, you're going to look for the number you want in that column. That's what the column's Value Extractor is doing (Only, of course, our goal in Grooper will be to capture all the values in the column).

The Header Extractor locates a column's header label.

Header labels.png
The Data Column's Value Extractor locates the column's values, using the Header Extractor's results as where to look down from.
Column values.png


As the name implies both “Header” extractors and “Value” extractors are required for this method to function. Configuring these extractors is done on each of the Data Columns.


The method you use to extract information from tables is set on Data Table objects added to a Data Model. This is done using the "Extract Method" property.


Method select context.png


Once you select an Extract Method (in this case "Header-Value"), you can see all the configurable properties available to that particular method by expanding the "Extract Method" property.


Header val configure.png


However, most of the "Header-Value" method's configuration is done using the Data Column objects. Here, you will set a Header Extractor and a Value Extractor to model the table's structure and extract information from each column. This can be done referencing extractors made in the Content Model or using simple text patterns configured on Header Extractor and Value Extractor properties.


Header val 1.png


Header val 2.png


Primary VS Secondary Columns

In order for Header-Value to map out a table's structure at least one column needs to be set as a "Primary Column". The Value Extractor on a Primary Column matches against the full document text, looking down from the header label's position extracted from the Header Extractor. This gives the context of where rows are positioned in the text. Once Grooper knows this, it can create a row instance for every row. Value Extractors on Secondary Columns then match the text inside the row instance, rather than the full document.

With at least one Primary Column, Grooper can find a value... ...which acts like an anchor, from which the row extends.
Primary col 4.png Primary col 5.png
And an instance for each row is created. The Secondary Column's Value Extractors then match against the text in these row instances.
Primary col 6.png


Information in a Primary Column should be present in every row of the table. If a cell is blank and a column is set as a Primary Column, "Header-Value" will skip that row entirely.

For example, if the "Order ID" column is set to Primary... ...and a value is missing...
Primary col 1.png Primary col 2.png
...none of the other columns information on that row will extract. Grooper would effectively "skip" the row.
Primary col 3.png

Version Differences

Use Cases

The Header-Value method is the second table extraction method created in Grooper. It was made to target tables not easily extracted by Row Match. Row Match looses its efficiency once a tables structure starts to change from document to document. Different companies are going to structure tables however they want, which is well outside your control. Think of all the different ways an invoice can be structured. While the information you want is present in all the different tables (line item amount, count and details for example), how that data is presented may not be consistent. Even just the column location changing or columns that can be optionally filled can present problems for the Row Match method. Row Match’s method of using a Row Extractor to pattern the table may not be able to do the job (or a complicated Row Extractor accounting for multiple row formats may need to be used). For these situations, the Header-Value method is often easier to configure and produces better results.

These are different Oil and Gas Production Reports from various sources. Each one organizes information differently into tables in different ways. Row Match would work just fine for each individual document. However, while the same information exists on each document, there's enough variability in the table structures that Row Match may not be suited for processing the whole document set. Header-Value is often a better route when presented with varying table structures in a document set.
50939766 Page 1.png Spreadsheet with lines.png 50946163.png 50946164 Page 1.png

Optional data columns, where values may or may not be present in a cell, can complicate things for Row Match as well. Again, a simple Row Extractor may not do the trick. While a more complicated extractor may successfully extract the table's information, the Header-Value method (or the Infer Grid) may be simpler to set up and produce the same or even better results.

However, the Header-Value method does have its limitations. Perhaps most obviously, header labels are necessary for this method to work. In tables where header labels are not present, Header-Value will may not be suitable for use. The Header-Value method may be able to target these types of tables, but can be more difficult to set up, or better targeted by the Row Match method.

Furthermore, the Header-Value method requires several extractors to detect a table’s structure and extract the values inside, at least two extractors for every Data Column (one for its header and one for its values). Because of this, there are several components to configure in order to extract a table’s information. For relatively simple tables, Row Match ends up being simpler to set up, both being less time consuming and using fewer objects.

The Infer Grid method also has some advantages over Header-Value. There are some specialized use cases, such as reading OMR checkboxes in tables and reprocessing table cells using a secondary OCR Profile, where Infer Grid does things the other two methods simply can’t easily do (and in some cases, simply can't). Infer Grid also performs well when table line information can be saved to a page’s layout data using a Line Detection or Line Removal IP command.

How To

Creating a Data Table in Grooper

Before you begin

A Data Table is a Data Element used to model and extract a table's information on a document. Just like other Data Elements, such as Data Fields and Data Sections, Data Tables are created as children of a Data Model. This guide assumes you have created a Content Model with a Data Model.

We will use the table below as our example for creating a Data Table.

Simpletable.png

Navigate to a Data Model

Using the Node Tree on the left side of Grooper Design Studio, navigate to the Data Model you wish to add the Data Table to. Data Tables can be created as children of any Data Model at any hierarchy in a Content Model.


Create a data table 1.png

Add a Data Table

Right click the Data Model object, mouse over "Add" and select "Data Table"


Create a data table 2.png


The following window will appear. Name the table whatever you would like and press "OK" when finished.


Create a data table 3.png


This creates a new Data Table object in the Node Tree underneath the Data Model.


Create a data table 4.png

Add Data Columns

Right click the Data Table object, mouse over "Add" and select "Data Column"


Create data table 5.png


This brings up the following window to name the Data Column. When finished, press "OK" to create the object.


Create data table 6.png


This creates a new Data Column object in the Node Tree underneath the Data Model.


Create data table 7.png

Repeat Until Finished

Add as many columns as necessary to complete the table. For our example, we have a single Data Table with five Data Columns, each one named for the corresponding column on the document.


Create data table 8.png

Image Processing for Tables

Tables present some specific image processing needs. This section goes over how to make an IP Profile with a few IP Commands useful for table extraction. Below are the examples we will target for image processing.

This is a fairly basic table. It has columns. It has rows. It has cells. It's a table. But notably for image processing it has table lines. Removing table lines can be useful to improve OCR results. Furthermore, Grooper's line removal will also save the location of those lines to memory for later use.


Ip table 1.png

This is also a fairly simple table, but it doesn't use table lines to break up rows and columns. Instead, it uses gray color bands to separate rows. It's fairly common for table rows to alternate background colors to help readers break up rows visually. However, these color bands can interfere with OCR, which needs a true black and white image to function best. Grooper can "binarize" the document to end up with a black and white image for OCR.


Ip table 3.png

This is a more complicated table with a combination of table lines and halftone regions. Our IP Profile will target both issues.


Ip table 2.png

Last, direct your attention to the highlighted portion of this invoice. Column headers are often formatted as white text on a black background. This is great for readability. It draws the viewer's attention to the column headers. However, it presents a problem for OCR, which is looking for black pixels to translate into text. Grooper can detect white text on a black background and invert the colors, leaving black text on a white background.


Ip table 4.png

To process these tables we will build an IP Profile with three IP Steps, each of which has a unique IP Command performing a specific image processing operation. Below are the three IP Commands we will use.

The Negative Region Removal command's property panel.

The Negative Region Removal command detects "negated regions" in an image and either inverts them or removes them. "Negative regions" are portions of a document where white text exists on a black background (or in the case of some color documents very light colored text on a darker colored background). In the majority of cases the desire is to invert those regions, turning the white text into black so that OCR can analyse the characters' pixels and translate them into machine readable text. However, negative regions could also just be solid black rectangles or boxes without any rhyme or reason why they are on the document. In those cases, Negative Region Removal could remove them, which would give OCR less to analyse, speeding up the OCR operation and potentially improving results.

Before Negative Region Removal
Neg reg before.png
After Negative Region Removal
Neg reg after.png
The Binarize command's property panel.

Binarize is an IP Command that shows up a lot. Binarize essentially turns color and grayscale images black and white. This could be used as a basic method of cleaning up a document, rendering it black and white to produce better OCR results. However, binarization is often helpful (and sometimes required) to perform other IP Commands. You will often see "Binarization Settings" as a "Preprocessing Option" on other IP Commands (This will temporarily binarize the image to make the IP Command work properly in these cases). For more information on binarization visit the Binarize article.

We will use the Binarize IP Command to remove halftone regions from these tables.

Before Binarize After Binarize
Binarize before.png Binarize after.png
The Line Removal command's property panel.

The Line Removal command removes horizontal and vertical lines from an image. This serves two very important functions when dealing with tables.

First, it gets rid of table lines interfering with OCR results. OCR operates by analyzing black pixels on a page, comparing clusters of pixels to examples of text characters, and making a decision about what machine readable character to assign them. Lines are also comprised of pixels, which means the OCR engine is going to try and figure out what they are too. Removing them will speed up the OCR operation and improve accuracy by removing non-semantic content competing for the OCR engines attention.

Second, it stores where those lines where for later later use. The positional information is saved to the page object's "LayoutData.json" file. Any time Grooper needs to know where those table lines are during extraction, it will reference that file to "see" where those lines are.

Before Line Removal After Line Removal
Line remv before.png Line remv after.png

Once the Line Removal command is applied, the lines positions are saved to a file named "LayoutData.json" for each Batch Page object in the Batch. You can verify this by viewing it in a page object's list of files. To do this, select a Batch Page in the Node Tree, use the "Advanced" tab to select the "Files" tab. If the Line Removal command successfully detected lines, you will see a file named "LayoutData.json". It will have the length, width, and x-y coordinates of every line detected.


Line remv layoutdata context.png

Next, we will create a simple IP Profile using these three IP Commands and apply it to the Batch of our four documents.

Create a New IP Profile

! Some of the tabs in this tutorial are longer than the others. Please scroll to the bottom of each step's tab before going to the step.

Navigate to the "Global Resources" folder in the Node Tree. Expand it, and right click the "IP Profiles" folder. Hover your mouse over "Add" and select "IP Profile".


Table ip 1.png


Name the IP Profile "Table IP". Press the "OK" button when finished.


Table ip 2.png

Add the IP Steps

IP Profiles are comprised of a series of IP Steps. Each IP Step will use an IP Command performing a specific image processing function. Press the "Add" button to add a new step in the IP Profile.


Table ip 3.png


Next, we choose which IP Command to perform during the step. The first step will be "Negative Region Removal". Mouse over the "Feature Removal" heading and choose "Negative Region Removal" from the list.


Table ip 4.png


This will take care of inverting the white text to black, taking care of table headers that were white text inside a black box.


Table ip 5.png


Next, we will add a "Binarize" step. (1) Press "Add". (2) Mouse over "Miscellaneous" and (3) select "Binarize".


Table ip 6.png


This will take care of turning the images black and white, removing the gray background from rows in the table seen below.


Table ip 7.png


Finally, we will add the "Line Removal" step. (1) Press "Add". (2) Mouse over "Miscellaneous" and (3) select "Binarize".


Table ip 8.png


This will remove table lines and save their position data.


Table ip 10.png


Finally, save the IP Profile by pressing the "Save" button.


Table ip 11.png

A Note on IP Command Properties

Each IP Command has a set of configurable properties. Part of Grooper's design philosophy is to give users as much control of an operation as possible. When possible, we try not to "black box" any operation, forcing users to use one set of settings or another. However, that amount of options that level of control gives can be daunting. Some IP Commands, such as Line Removal, have a great deal of configurable properties.


Table ip 12.png


That being said, we have configured the default settings for the largest number of use cases. For these four very different tables, we don't actually need to touch any of the properties in order for this IP Profile to do its job. The property panel is there should you need to make adjustments, but the defaults generally produce good results for most documents.

Should you need to adjust the properties, the diagnostics panel will be an invaluable resource.


Table ip 13.png


The diagnostics panel will provide an "Input Image" showing the document image before the IP Profile is applied and an "Output Image" showing the resulting image. In between are folders containing diagnostic images for every step along the way, including an "Input Image" and "Output Image" for the individual step. These diagnostic images will be an asset when configuring the IP Command's properties. You can use these diagnostic images to test and verify the results of manipulating individual properties.

A Note on Order of Operations

Each IP Command is applied to an image in the order it appears in the IP Profile, from top to bottom, first step to last. This means the image is changed according to the IP Command on the first step, then that altered image is handed to the next step and processed by its IP Command. So, the order in which these steps are processed often does matter.

Even with an example as simple as ours, if we were to move these steps around, we would get different (and undesirable) results. Using the default settings, negative region removal keeps an outline of the black box around the text. In our profile line removal is the last step, so it removes that box (and saves its lines position data!).


Table ip 14.png


However, if we were to perform Line Removal first, the lines would remain. It's operation would complete after the first step and would not run again to remove them.


Table ip 15.png


This could result in poorer OCR results for the table headers ("Part Number/Description" close to the lines around it). And, if we ended up needing do know the positions of those lines during extraction, they would not be saved to the pages layout data file. This may not be a huge deal for this particular IP Profile. However, there are cases where changing the order of steps in an IP Profile will drastically hinder or help your outcomes.


Last, we need to apply this IP Profile. An IP Profile can be applied to a document set in one of two ways. It can be done during Image Processing for permanent alteration. Or, it can be applied to an OCR Profile during the Recognize activity for temporary alteration.

This case is most suitable for temporary image cleanup. What we've done here is made some adjustments to improve OCR and save table line data. However, we've drastically altered the original image. Anyone looking at these documents in the future are still going to want to see those table lines!


Configuring Header-Value for the Missing Cells Problem

Many tables have optional columns. Data may or may not exist in those cells for a given row. Since the Row Match method works by making patterns to match each row, this can cause problems. Sometimes, a single pattern doesn't cut it, and multiple patterns must be used in order to properly model each row. You may end up making multiple extractors, using multiple patterns, to account for every row's variation. One for if a value is in the optional column, one for if it is not there. The more optional columns on the table, the more variations in the row's pattern you have to account for. This can become very messy depending on the size of the table.

Header-Value works differently, rather than working off each row's pattern, it looks to the header labels and values underneath to figure out the table's structure. It can end up being simpler to configure and produce better results.

! Some of the tabs in this tutorial are longer than the others. Please scroll to the bottom of each step's tab before going to the step.

Before you begin

A Data Table is a Data Element used to model and extract a table's information on a document. Just like other Data Elements, such as Data Fields and Data Sections, Data Tables are created as children of a Data Model. This guide assumes you have created a Content Model with a Data Model.

We will use the table below as our example. This is part of an expense report detailing intangible costs (as opposed to material costs). The raw text data has already been extracted using OCR via the Recognize activity.


Intangibles Table.png

Add a Data Table

Create a Data Table with five Data Columns. The five columns for our example are "Description", "Code", "Dry Hole", "Completion", and "Total". Refer to the Creating a Data Table in Grooper section above for more information on adding a Data Table to a Data Model.


Header val miss cell 1.png

Set the Extract Method

First, set the "Extract Method" property to "Header-Value". (1) Select the Data Table object in the Node Tree, and (2) select the "Extract Method" property.


Header val miss cell 2.png


Using the dropdown list, select "Header-Value".


Header val miss cell 3.png


We will not configure any properties here for the time being. Much of the Header-Value method's setup is done on the Data Table's individual Data Column objects. We will configure those first.

Configure the Header Extractor

(1) Navigate to the "Description" Data Column in the Node Tree. (2) Select the "Header Extractor" property.


Header val miss cell 4.png


The extractor's "Type" can be "Internal" or "Reference". Choosing "Internal" will allow you to write a regular expression pattern straight from the property panel. "Reference" will allow you to point to an extractor built elsewhere in the Node Tree. Internal extractors are typically very simple patterns that don't need the extra properties available to Data Type extractors, such as collation, filtering or post-processing. Our case here is very simple. We will create the Header Extractor using an Internal extractor.

Expand the "Header Extractor" property, select the "Type" property and choose "Internal" from the dropdown list.


Header val miss cell 5.png


Next, select the "Pattern" property and press the ellipsis button at the end to bring up the Pattern Editor.


Header val miss cell 6.png


This brings up Grooper's Pattern Editor to write regular expression patterns. We are just looking to match the header label "INTANGIBLES".


Header val miss cell 7.png


A simple regex pattern intangibles will match just fine.


Header val miss cell 8.png


Press "OK" to save and close the Pattern Editor.


Header val miss cell 9.png


We now have a Header Extractor for the Description Data Column. Next, we need to find the values in that column. That will be done using the Data Column's Value Extractor property.

Configure the Value Extractor

The Data Column's Value Extractor is putting the "Value" in "Header-Value". This too can be either an "Internal" or "Reference" extractor.

FYI Version 2.8 added several new extraction methods available to Data Fields and Data Columns besides "Internal" and "Reference". "Internal" has been renamed "Text Pattern", but it functions exactly like an Internal pattern.

We will continue configuring the "Description" Data Column. This extractor will find the line item descriptions for each row below the "Intangibles" column. (1) Select the "Description" Data Column in the Node Tree. (2) Select the "Value Extractor" property and choose (3) "Text Pattern" from the dropdown list.


Header val miss cell 10.png


Expand the "Value Extractor" properties. Select "Pattern" and press the ellipsis button at the end to bring up the Pattern Editor.


Header val miss cell 11.png


We are going to use what we affectionately call the "God Mode Pattern" [^\r\n\t\f]+

This pattern matches anything that is not a control character or large whitespace of variable length (one character long to any number of characters long). This functionally segments the text on the page. For this table, every segment of text inside a table cell is captured by this pattern, including everything in the column labeled "Intangibles".

! Remember to enable Tab Marking the Preprocessing Options of the Properties tab.


Header val miss cell 12.png


! The God Mode Pattern works here because we have a clean table with clean OCR results. If you need to use fuzzy matching to get around OCR errors, you cannot use the variable length modifiers + or *. In these cases a more specific pattern would need to be written for the Value Extractor.

This is all we will configure on this column for the time being. Next, we will add Header Extractors and Value Extractors for the remaining four data columns.

Configure the Remaining Header Extractors

We need to find the header positions for the remaining four Data Columns, using the Header Extractor property. Their configuration will be similar to the "Description" Data Column. We will set the "Header Extractor" property's "Type" to internal and write a simple regex to match the corresponding column's label.

The "Code" Column

Select the "Code" Data Column. Expand the "Header Extractor" property. Select "Type" and choose "Internal" from the dropdown list.


Header val miss cell 13.png


Using the Pattern Editor, we will simply use the regex pattern code to match the Code column's header.


Header val miss cell 14.png


The "Dry Hole" Column

Select the "Dry Hole" Data Column. Expand the "Header Extractor" property. Select "Type" and choose "Internal" from the dropdown list.


Header val miss cell 15.png


Using the Pattern Editor, we will simply use the regex pattern dry hole to match the Dry Hole column's header.


Header val miss cell 16.png


The "Completion" Column

Select the "Completion" Data Column. Expand the "Header Extractor" property. Select "Type" and choose "Internal" from the dropdown list.


Header val miss cell 17.png


Using the Pattern Editor, we will simply use the regex pattern completion to match the Completion column's header. Press the "OK" button when finished.


Header val miss cell 18.png


The "Total" Column

Select the "Total" Data Column. Expand the "Header Extractor" property. Select "Type" and choose "Internal" from the dropdown list.


Header val miss cell 19.png


Using the Pattern Editor, we will simply use the regex pattern total to match the Total column's header.


Header val miss cell 20.png

Configure the Remaining Value Extractors - The "Code" Column

Now that we have all the Header Extractors configured, we need to finish up the Value Extractors on each Data Column. We will start with the "Code" Data Column.

The values in the code column are all seven digit numbers. We can make a fairly simple regular expression pattern to match these numbers. Select the Value Extractor property and choose "Text Pattern" from the drop down list.


Header val miss cell 21.png


Expand the "Value Extractor" property. Select the "Pattern" property and press the ellipsis button at the end.


Header val miss cell 22.png


Use the following regex to match each seven digit code in the column: \d{7}


Header val miss cell 23.png


Press "OK" to save and close the Pattern Editor. Don't forget to Save the Data Column as well.

Configure the Remaining Value Extractors - The "Dry Hole" "Completion" and "Total" Columns

Looking at this document, all of these columns are currency values. There's no need to write three separate Internal extractors for each Value Extractor in each Data Column. We can create a generic currency extractor and use it for all three of the remaining columns.

Right click the Local Resources folder of your Content Model in the Node Tree. Mouse over "Add" and select "Data Type". This will create a new Data Type extractor in the folder we can use for our Data Columns' Value Extractor.


Header val miss cell 24.png


A window will pop up to name the Data Type extractor. Name it "Generic - Currency". Press "OK" when finished.


Header val miss cell 25.png


Using the property panel, under the "Data Extraction" heading, select "Pattern" and press the ellipsis button at the end of the property. This is where we will write a generic regex to match currency values on the document.


Header val miss cell 26.png


The following regex will match all currency on this document: [$]\d{1,3},\d{3}

Press "OK" to exit the Pattern Editor when finished.


Header val miss cell 27.png


Press "Save" to finish configuring the "Generic - Currency" extractor.


Header val miss cell 28.png


Now that we have an extractor built matching currency values on the document, we just need to reference it on each Data Column's Value Extractor.

Select the "Dry Hole" Data Column in the Node Tree. Select the "Value Extractor" property and choose "Reference" from the dropdown list.


Header val miss cell 29.png


Expand the "Value Extractor" property. Select the "Extractor" property and using the dropdown menu locate the "Generic - Currency" extractor we created earlier. The dropdown menu replicates portions of the Node Tree. Expand folders as you would in the Node Tree to find the extractor you're looking for. The first folder in the list will be your Content Model's Local Resource folder. This is where we created the "Generic - Currency" extractor.


Header val miss cell 30.png


The Data Column will use the results of the referenced extractor when looking for the values below the extracted Header. Don't forget to Save!


Header val miss cell 31.png


The "Completion" and "Total" columns will also use the "Generic - Currency" extractor as their Value Extractors. Complete the same steps to reference the "Generic - Currency" extractor for the "Completion" and "Total" columns.

Set Primary Columns

If we were to stop right now and go back to our Data Table, and press the "Test Extraction" button, no results would appear.


Header val miss cell 32.png


This is because we haven't set a "Primary Column". Primary Columns are used in the Header-Value method to indicate what data must be present in a row to constitute a legitimate or valid table row.

For our example, the item descriptions in the column labeled "Intangibles" are always present. The item codes in the "Code" column are always present. And, the "Total" column is always filled in. Rows on these table would not be a valid row without those pieces of information. However, the "Dry Hole" and "Completion" columns may or may not have data in a given row.


Header val miss cell 33.png


! At least one Data Column must be made a Primary Column. Setting a Primary Column is part of how the Header-Value method determines a table's structure. If a cell is empty in a Primary Column, the Header-Value will skip that row. It will not "see" it as part of the table.

We will set the "Description", "Code", and "Total" Data Columns as Primary Columns. Let's start with "Description". Navigate the the "Description" Data Column in the Node Tree and select "Primary Column" under the "Column Settings" heading.


Header val miss cell 34.png


(1) Change the "Primary Column" property from "False" to "True". (2) Press the Save button when finished.


Header val miss cell 35.png


Follow the same process to make the "Code" and "Total" Data Columns Primary Columns.

FYI Data Columns not used as Primary Columns are referred to as "Secondary Columns"

Now that we have Primary Columns, the Header-Value method will be able to model the table's structure and extract the data inside. Navigate back to the Data Table and press the Test Extraction button.


Header val miss cell 36.png

Configuring Header-Value for the Variation Problem

! Some of the tabs in this tutorial are longer than the others. Please scroll to the bottom of each step's tab before going to the step.

Before you begin

A Data Table is a Data Element used to model and extract a table's information on a document. Just like other Data Elements, such as Data Fields and Data Sections, Data Tables are created as children of a Data Model. This guide assumes you have created a Content Model with a Data Model.

This tutorial goes after problems around variety. The Header-Value extraction method is very useful when targeting tables whose structure changes, but the desired information is contained across multiple table styles. These documents are several production reports filed with the Oklahoma Corporation Commission from various oil and gas companies. The raw text data has already been extracted using OCR via the Recognize activity.

Prod rpt1.png Prod rpt2.png Prod rpt3.png Prod rpt4.png
Prod rpt5.png Prod rpt6.png Prod rpt7.png Prod rpt8.png Prod rpt9.png

Add a Data Table

Create a Data Table with four Data Columns. The four columns for our example are "Well Name", "PUN", "Prod Code", and "Volume". Refer to the Creating a Data Table in Grooper section above for more information on adding a Data Table to a Data Model.


Header val variation 1.png

Set the Extract Method

First, set the "Extract Method" property to "Header-Value". (1) Select the Data Table object in the Node Tree. (2) Select the "Extract Method" property. And (3) using the dropdown list, select "Header-Value".


Header val variation 2.png


Get a Game Plan

We went through a simple example of Header-Value extraction in the how to tutorial above. With a larger document set, it's a good idea to put in a little work on the front end to analyze the documents in font of you and get an idea of what your workflow is going to be. Let's break down what we're doing in Grooper for the Header-Value method.

1. Create Header Extractors

  • Each document will have their own way of labeling columns. For example, a Product Code may be on the page as "Prod Code", "Prod. Code", "PC", "Product Code" and more. A single pattern isn't going to cut it here. We'll want to create extractors for each column header and reference them on the Data Columns.

2. Create Value Extractors

  • Just as the columns have their own formats, the values in those columns likely will too. We can get an idea of extractors we'll want to write by looking through the document set. We'll probably need a somewhat generic text segment extractor to pull the well names. There are a couple formats we'll need to tackle for the production unit number. With variation in how values are on each table style as well, we'll want to build some extractors for each Data Column's Value Extractor as well.

3. Configure the Data Columns

  • This will mostly be setting references to the extractors we build, but we can also start thinking about which Data Columns should be Primary Columns.

4. Configure the Header-Value properties on the Data Table (if needed).

  • We will do some troubleshooting at the end to see if we need to adjust the default properties to get better results.

Create the Header Extractors

Let's Get Organized - Make a Column Headers Folder

In the Local Resources folder of the Content Model, add a folder named "Column Headers". We will create all our Header Extactors here.

Right click the Local Resources folder. Hover over "Add" and select "Folder..."


Header val variation 3.png


Name the folder "Column Headers" and press "OK".


Header val variation 4.png


Create a Header Extractor - Production Unit Number

First, we will create an extractor to find the "PUN" column headers. Right click the "Column Headers" folder. Hover over "Add" and select "Data Type". This will create a new Data Type extractor that we will configure to find all the variations of the header labels containing Production Unit Numbers.


Header val variation 5.png


Name it "COL HEAD - PUN" and press the "OK" button. This naming will help us know the extractor is for a column header (COL HEAD) targeting production unit numbers (PUN).


Header val variation 6.png


Now we need to write extraction using regular expression to match each document's column header. Select the "Pattern" property and press the ellipsis button at the end to bring up the Pattern Editor.


Header val variation 7.png


We have nine different reports in this document set. Almost all nine have a different way of labeling the column that contains the production unit number.

PRODUCTION UNIT # Pun var 1.png
OTC Lease Number / Production Unit Number (PUN) Pun var 2.png
STATE LEASE NUMBER Pun var 3.png
OKLAHOMA PUN Pun var 4.png
OTC LEASE NO. Pun var 5.png
Production Unit Number Pun var 6.png
OTC Lease Number / Production Unit Number (PUN) (again) Pun var 7.png
Production Unit Number (again) Pun var 8.png
PRODUCTION UNIT # (This one will be a special case) Pun var 9.png


Make a Regex List of Labels

For the first eight documents, we can create a list of regular expressions matching these labels separated by the vertical bar |. The following would be the regex entered in the "Value Pattern" window.

PRODUCTION UNIT #|
OTC Lease Number / Production Unit Number [(]PUN[)]|
STATE LEASE NUMBER|
OKLAHOMA PUN|
OTC LEASE NO[.]|
Production Unit Number
FYI Certain characters (called metacharacters) perform special operations in regular expression. These include the dot . which is a wildcard character that can match any character, and parenthesis ( and ) which are used to make groups. If you want to literally match these characters, they need to be "escaped". This can be done by placing a backslash \ before the character or place the character in a character set []

So, \. or [.] would match only a period on the page. And \( or [(] will match a starting parenthesis character on the page.


Header val variation 8.png


Press the "OK" button when finished to exit the Pattern Editor. Save your work on the "COL HEAD - PUN" extractor.


Header val variation 11.png


Ordered Arrays to Target Header Labels

If you recall, when we looked at all the variations of this header label, we noted one would be a special case. Most of our header labels read left to right on a single line. However, the table below reads top to bottom on two lines. Regular expression matches along a normal text flow, left to right, matching one character after the next. Since the label in this example is on two lines the simple regex PRODUCTION UNIT # does not match.


Pun var 9.png


This is very common when dealing with tables. We need to account for this format by creating a child extractor which will return its result to its parent (The "COL HEAD - PUN" extractor). We will add a Data Type child extractor that matches each line individually and uses Ordered Array collation to combine the results, creating a single label for the Header Extractor to use. Since this will be a child extractor, its result will be combined with the list of header labels we wrote earlier.

FYI
Parent child extractor.png
Data Type extractors "inherit" the results of any extractor added as a "child". Every result the child extractors find on a page get passed up to the parent extractor. Child extractors can be a simple Data Format extractor or even another Data Type extractor (even with its own children).


Right click the "COL HEAD - PUN" extractor to add a child extractor. Hover over "Add" and select "Data Type"


Header val variation 12.png


Since we will use the Ordered Array collation looking to align items vertically, and we're putting two lines together, name the extractor "OAV - 2 Lines". Press the "OK" button to create the extractor.


Header val variation 13.png


This extractor will have two children of its own, both data formats. One looking for the word PRODUCTION and another for the phrase UNIT #. Add two Data Formats by right clicking the "OAV - 2 Line" extractor, selecting "Add" then "Data Format". Name the first "Line 1" and the second "Line 2".


Header val variation 14.png


For the first Data Format, "Line 1", write PRODUCTION in the Value Pattern box.


Header val variation 15.png


For the second Data Format, "Line 2", write UNIT # in the Value Pattern box.


Header val variation 16.png


Navigate back to the "OAV - 2 Line" extractor, the parent of these two Data Formats. For each page, we are getting individual results from each Data Format. By default, Data Type collation is set to "Individual" this gives us a list of all results from the child extractor, one after the other. We want to combine these two results into one.


Header val variation 17.png


Select the "Collation" property and using the dropdown list choose "Ordered Array".


Header val variation 18.png


Ordered Arrays combine the results of child extractors, in the order which they appear. This is important because it impacts the order in which you create the child extractors. We are looking to produce a result for "Production Unit #" not "Unit # Production". The "Line 1" format needs to come first because the first word is "Production"

Ordered Arrays also need a "layout". This can be "Horizontal" (in which results are in a line left to right) "Vertical" (in which results are in a line top to bottom) or "Flow" (in which results are in a text flow left to right, top to bottom). We need to use "Vertical Layout" since our two lines are stacked on top of each other.

Expand the "Collation" property. Select "Vertical Layout" and change the property from "Disabled" to "Enabled". This will combine the results of the child extractors, giving us a single result we can use to match the column header for this document.


Header val variation 19.png


Header val variation 20.png

Configure the Remaining Header Extractors

Follow similar steps to configure three more extractors for the remaining column headers. We will detail the regex patterns used below.

The "Volume" Column

Value Pattern:

Gross Volume|Barrels|C|Quantity|Volume|Runs

Look Ahead Pattern:

[\n\t\f]|^

Look Behind Pattern:

[\r\t\f]|$

Properties:

Mode: FuzzyRegEx with Insert( ) weighting

Tab Marking: Enabled

The "Well Name" Column

Value Pattern:

Well Name( and Number)?|Property #|(IES )?Lease Name

Look Ahead Pattern:

[\n\t\f]|^

Look Behind Pattern:

[\r\t\f]|$

Properties:

Mode: FuzzyRegEx

Tab Marking: Enabled

The "Prod Code" Column

Value Pattern:

Prod[.]? Code|PC|PRD

Look Ahead Pattern:

[\n\s\f|]|^

Look Behind Pattern:

[\s\r\f|]|$

Properties:

Tab Marking: Enabled

This column also needs a child extractor using Ordered Array collation to locate a couple header label formats.

Create the Value Extractors

Let's Get Organized - Make a Value Extractors Folder

In the Local Resources folder of the Content Model, add a folder named "Value Extractors". We will create all our Data Column Value Extactors here.

Right click the Local Resources folder. Hover over "Add" and select "Folder..."


Header val variation 21.png


Name it "Value Extractors" and press "OK" when finished to create the folder.


Header val variation 22.png


A Generic Text Segment Extractor - The "Well Name" Column's Value Extractor

The first extractor we will create is a general purpose text segment extractor. We will use this to target the information in the "Well Name" columns. Right click the "Value Extractors" folder, hover your cursor over "Add" and select "Data Type".


Header val variation 23.png


Name the Data Type "Generic Text Segment" and press "OK" when finished.


Header val variation 24.png


Select the "Pattern" property and press the ellipsis button at the end to bring up the Pattern Editor.


Header val variation 25.png


We will use a variation of the "God Mode Pattern" ([^\r\n\t\f]+) to match segments of text. This means we're going to use tab characters \t in our pattern. We'll need to turn on Tab Marking in the "Properties" tab before we begin.


Header val variation 26.png


The "Tab Marking" property also has its own sub-properties, including ways to manually set how long a space should be in order to qualify as a tab. It also has a super useful property when dealing with tables.

Sometimes tables are not aligned perfectly and information in a cell bumps right next to the next cell, such as is the case in the example below.


Detect lines tab 1.png


With basic Tab Marking, that space between the text in the first column and the text in the second just looks like a regular space. It's not wide enough to count as a tab character. As you can see in the text data below, there is no tab character between the well name "Creek Indian Dutcher Unit" and the production unit number "111-091643-0-2884". But we want that tab character! It gives us an indication where text in the first column ends and text in the second begins.


Detect lines tab 2.png


There is a sub-property of Tab Marking named "Detect Lines". When enabled, this property will use the page's layout data (obtained from a Line Detection or Line Removal IP Command) to determine if there is a vertical line between the characters. If there is a line present, it will insert a tab character between them, allowing us to use it as an indication of a column break.

Expand the "Tab Marking" sub-properties, select "Detect Lines" and change the property from "False" to "True".


Header val variation 27.png


With Detect Lines enabled, a tab character is now between the well name and the production unit number.


Detect lines tab 3.png


Back to the Pattern Editor. We will use the following adjustment of the "God Mode Pattern"

[^\s:][^\r\n\t\f:]+[^\s:]

The God Mode Pattern is very generic. This pattern gives a little more specificity to it. First, by including [^\s:] at the beginning and end, it forces the segment to be at least three characters long. I'll break it down below.

Regex Result Character Count
[^\s:] Matches a character that is not a whitespace character or a colon. 1
[^\r\n\t\f:]+ Now the middle part of the segment, matching one to any number of characters that aren't a control character or a colon. At least 1
[^\s:] Matches a single character that is not a whitespace character or a colon. 1
TOTAL: At least 3

This pattern both gets rid of results containing a colon character (since colons are often used in text to separate segments of text themselves) and forces the result to be at least three characters long. This pattern also ensures the segment does not start or end in a single space character.

Press "OK" to exit the Pattern Editor and "Save" to finish creating the the extractor.

The "Prod Code" Column's Value Extractor

The "Prod Code" Data Column should return a product code, which can be either "0" "1" or "01". This is going to present some particular extraction challenges. First, we are necessarily matching the characters "0" and "1". These commonly get misread by OCR engines. A zero can be read as an "O". A one can be read as a lower case "l" or an uppercase "I". So, we're going to need to do some fuzzy matching to get the right value. However, the pattern is also very small, which means we're going to run into some problems when it comes to character swapping.

First, add a Data Type named "Prod Code" to the "Value Extractors" folder and edit its "Pattern" property.


Header val variation 28.png


The regex value pattern itself will be very simple: [01]|01


But we don't want just any old one or zero. We also need a look ahead and look behind pattern to anchor these ones and zeros to a column location.

Look Ahead Pattern: [\n\t\f]|^

Look Behind Pattern: [\r\t\f]|$

! Be sure to enable Tab Marking with Detect Lines turned on.

This pattern is off to a good start. It's limiting matches to (mostly) values in the product code columns, but we don't have to go far before it fails to hit a value. The pattern does not match a value on the second document due to bad OCR.


Header val variation 29.png


OCR isn't perfect. It's the nature of the beast. FuzzyRegEx allows us to get around bad OCR and clean up our data through "fuzzy" matching. Switch to the "Properties" panel. Select the "Mode" property and change it from "RegEx" to "FuzzyRegEx".


Header val variation 30.png


FuzzyRegEx swaps out mismatched characters at the cost of confidence. This is a very short pattern. Even swapping one character is going to drop the confidence of our result way down. We would have to drop the "Minimum Similarity" property all the way down to "50%" in order to match this value, even with fuzzy matching. This is also going to match a bunch of stuff we don't want to match and give us a lot of false positives.


Header val variation 31.png


We also have the ability to define "Fuzzy Match Weightings". These weightings allow us to manually change what it costs to swap one character for another. An "0" and a "Q" look very similar. We can expect OCR to mistake these characters more frequently than others. Using Fuzzy Match Weightings we can set the cost to swap those to characters at a lower value than normal. We will use a Lexicon that ships with every Grooper installation right now. It contains a list of common character swaps and their custom weighting values.

Expand the "Fuzzy Match Weightings" property. Select the "Included Lexicons" property and using the dropdown menu, navigate to the "Fuzzy Match Weightings" lexicon. It is found traveling the following path in the Node Tree:

Global Resources -> Lexicons -> Downloads -> Weightings -> "Fuzzy Match Weightings"

Check the box next to the "Fuzzy Match Weightings" lexicon.


Header val variation 32.png


This will use the weightings in the lexicon, bumping the confidence up to 88%. Much better. Now we can bump the Minimum Similarity back up to something like 80%.


Header val variation 33.png


Press "OK" to exit the pattern editor and "Save" the extractor when finished.

Configure the Remaining Value Extractors

The "PUN" Column's Value Extractor

The Production Unit Number extractor can be made as a Data Type with four Data Format children, one for each way the production unit number is laid out in these columns. Name this extractor "Production Unit Number"


Data Format 1: 123-123456-1-1234

Value Pattern: \d{3}[.-]\d{6}[.-]\d[.-]\d{4}

Look Ahead Pattern: [\n\t\f]|^

Look Behind Pattern: [\r\t\f]|$


Data Format 2: 123-123456-1

Value Pattern: \d{3}[.-]\d{6}[.-]\d

Look Ahead Pattern: [\n\t\f]|^

Look Behind Pattern: [\r\t\f]|$


Data Format 3: 123-123456

Value Pattern: \d{3}[.-]\d{6}

Look Ahead Pattern: [\n\t\f]|^

Look Behind Pattern: [\r\t\f]|$


Data Format 4: 123456

Value Pattern: \d{6}

Look Ahead Pattern: [\n\t\f]|^

Look Behind Pattern: [\r\t\f]|$


The results of these Data Formats overlap each other. We can deduplicate our results using the parent Data Type's "Deduplicate Locations" property, setting it to "True". This will return the longest result, discarding the smaller overlapping results. (ie Both Format 4 and Format 3 will match "123-123456" giving two results "123-123456" and "123456". Deduplicating locations will discard the smaller result "123456" and return only "123-123456")

A Generic Quantity Extractor - The "Volume" Column's Value Extractor

The Volume extractor can be made using a Data Type extractor looking for any quantity that matches up to 999,999.99. Name this extractor "Quantity (Volume)".

Value Pattern: \d{1,3}(,?\d{3})?([.]\d{1,2})?

Look Ahead Pattern: [\n\t\f]|^

Look Behind Pattern: [\r\t\f]|$

Configure the Data Columns - Set Your References

Now that we have a set of Header Extractors to locate the header labels on these tables and Value Extractors to find their corresponding values in the rows beneath, we can configure our Data Columns. We need to do is set each Header Extractor to the right Data Column and do the same with the Value Extractors.

Decide Which Columns Are Primary Columns

Also, we must decide which columns are "Primary Columns". For this data set, we expect to find a value in each one of our four columns. There's always a well name, always a production unit number, always a product code, and always a barrels volume. We could, then, make all these columns Primary. However, remember we had some issues with the Prod Code extractor. It's a pattern only one or two characters long that we had to use fuzzy matching with a weighting lexicon in order to get a pattern to match reliably. Thinking ahead, if any of our Value Extractors are going to fail to find a value, it's probably going to be the Prod Code extractor. If we make the Prod Code column a Primary Column, and the Prod Code Value Extractor fails to produce a result on a given row, that row will get tossed out of our extracted table. So, there's reasonable danger we would lose data. Another option is to leave the Prod Code column as a regular column (also called a Secondary Column) and have a human reviewer validate the data and enter the missing cell as needed. That way we still get the data from the other three columns.

So, we will set the "Well Name", "PUN", and "Volume" Data Columns our primary columns. Configure the four Data Columns as seen in the images below.

! At least one Data Column must be made a Primary Column. Setting a Primary Column is part of how the Header-Value method determines a table's structure. If a cell is empty in a Primary Column, the Header-Value will skip that row. It will not "see" it as part of the table.

The Well Name Column

Header val variation 34.png

The PUN Column

Header val variation 35.png

The Prod Code Column

Header val variation 36.png

The Volume Column

Header val variation 37.png

Verify Extraction

Navigate back to the Data Table object in the Node Tree and check out how we did. Press the "Test Extraction" button to verify our results. This method successfully targets all nine different tables, regardless of the differences in column header names, column positions, and variance in cell data formats. This is the true benefit of the Header-Value method in action. With a little bit of setup to configure Header and Value extractors for each column, we can extract data from tables whose formats differ greatly but the data inside is the same.


Header val variation 38.png


Troubleshooting

The Header-Value method has some unique things to keep in mind when it comes to troubleshooting. These mostly have to do with how row instances are created and used.

The first thing to keep in mind is how Primary Columns establish the table's structure by defining the row instances. Primary Columns dictate what information is needed in a row to count as a valid row.

Remember, the Header Extractor locates the column header label's position and the Value Extractor matches the pattern of data within that column's cells. Primary Columns establish the row instances by using the header label's position as the context of where the column begins. The Data Column's Value Extractor then looks line by line to "see" if there is a match in the full text below that header. If one is found, that row of text becomes a row instance and the Value Extractor looks to the next line to see if there is a matching value. If there is, another row instance is created, and so on until no more matches are returned and the table effectively terminates.

For Primary Columns...


Troubleshooting hv 1.png


Troubleshooting hv 2.png


Troubleshooting hv 3.png


However, Value Extractors don't just look at the line below the header's location, but the line the header is on as well. This can produce an unexpected or undesirable result. In the example below, a Data Table was set up using the Header-Value method with five Data Columns, one for each column on the document. Only the "Operator Name" Data Column was set as a Primary Column. The "Operator Name" Data Column's Value Extractor uses a very generic regular expression pattern to match segments of text. This is great for matching the operator names in that column, but the text string "Operator Name" is itself a text segment. So, the header line returns as the first row of the table.


Troubleshooting hv 4.png


Note the "Well Name" column also returned the column header "Well Name and Number", but the remaining columns have empty values for the row. This has to do with how these Secondary Column's Value Extractors are matching the produced row instance. The "Well Name" column's Value Extractor also uses a generic text segment extractor which ends up matching the text "Well Name and Number" inside the row instance. However, the remaining three columns' Value Extractors do not match the text in the instance. They are configured using more specific regular expression patterns, matching the pattern of the data in each column more closely. For example, the "Lease Number" column's Value Extractor uses something like \d{3}-\d{6}-\d-\d{4} which does not match the text "OTC Lease Number / Production Unit Number"


Troubleshooting hv 5.png


In order to fix this issue, a different column (or combination of columns) should be made the Primary Column. Knowing the "Lease Number" column's Value Extractor does not match the text of line of text containing the header labels (and we always expect to find a lease number for every row), if we use that as the Primary Column, we get the result we expect to see.


Troubleshooting hv 6.png

This issue also with instances, but how Secondary Columns use row instances to return values rather than how Primary Columns create row instances.

In this example, the two tables are extracted using the same Data Table using the Header-Value method and the same Data Columns. "Order Date" is the only Secondary Column. The rest are Primary Columns. The only difference is one table has table lines. The other does not. For our purposes, the Value Extractor used for the "Order Date" column is a generic text segment extractor.

For this table, all table values are extracted with no problem. Here, only part of the values in the "Order Date" are extracted.
Troubleshooting hv 7.png Troubleshooting hv 8.png


What's happening here has to do with how each column's header position is used to extract data from the row instance. The important thing to remember is, for Secondary Columns, the width of the header extracted by the Header Extractor defines the width of the row instance that column's Value Extractor can use. In other words, the width of the header defines the sub-instance of the row instance the Value Extractor is running against.

For Secondary Columns, the width of the column header limits what can be extracted from the row instance
Troubleshooting hv 9.png
It is as if the rest of that row instance just isn't even there.

Troubleshooting hv 10.png


As you can see in the table's "Instance View", the entire line of text was captured for the row instance. However, the Secondary Column's Value Extractor is only runs against part of the text, specifically the width of the column header that overlaps that row.


Troubleshooting hv 11.png


So, why did the table with lines capture the entire date? Long story short, because it has lines.

If line positions are saved via a Line Detection or Line Removal IP Command, the "Detect Lines" property will expand the width of the column header to the surrounding lines. That way, the width of the row instance the Value Extractor runs against expands to the boundaries of the cell it is contained in.


Troubleshooting hv 12.png