2.80:Header-Value (Table Extract Method)

From Grooper Wiki
Revision as of 18:06, 24 January 2020 by Dgreenwood (talk | contribs)
The Header-Value Extract Method's properties

Header-Value is one of three methods available to Data Table elements to extract information from tables on a document set. It uses a combination of column header and column value extractors to determine the table’s structure and extract information from the table’s cells.

About

Where the Row Match method focuses on using a table’s rows to model table structure and extract data, ‘’’Header-Value’’’ looks to the table’s columns. Extractors are used to find the header labels and the values in those columns. This is very similar to how a human being reads a table. Imagine you're trying to find a piece of information from the table below, say a particular order identification number. The first thing you're going to do is look for the Order ID column. That is what the Header Extractor does. Then, you're going to look for the number you want in that column. That's what the column's Value Extractor is doing (Only, of course, our goal in Grooper will be to capture all the values in the column).

The Header Extractor locates a column's header label.

The Data Column's Value Extractor locates the column's values, using the Header Extractor's results as where to look down from.


As the name implies both “Header” extractors and “Value” extractors are required for this method to function. Configuring these extractors is done on each of the Data Columns.



--- loose explanation using Star Wars table of header-value extraction---

Version Differences

Use Cases

The Header-Value method is the second table extraction method created in Grooper. It was made to target tables not easily extracted by Row Match. Row Match looses its efficiency once a tables structure starts to change from document to document. Different companies are going to structure tables however they want, which is well outside your control. Think of all the different ways an invoice can be structured. While the information you want is present in all the different tables, how that data is presented may not be consistent. Even just the column location changing can present problems for this method. Row Match’s method of using a Row Extractor to pattern the table may not be able to do the job (or a complicated Row Extractor accounting for multiple row formats may need to be used). For these situations, the Header-Value method is often easier to configure and produces better results.

These are different Oil and Gas Production Reports from various sources. Each one organizes information differently into tables in different ways. Row Match would work just fine for each individual document. However, while the same information exists on each document, there's enough variability in the table structures that Row Match may not be suited for processing the whole document set. Header-Value is usually a better route.

Optional data columns, where values may or may not be present in a cell, can complicate things for Row Match as well. Again, a simple Row Extractor may not do the trick. While a more complicated extractor may successfully extract the table's information, the Header-Value method (or the Infer Grid) may be simpler to set up and produce the same or even better results.

However, the Header-Value method does have its limitations. Perhaps most obviously, header labels are necessary for this method to work. In tables where header labels are not present, Header-Value will not be suitable for use.

Furthermore, the Header-Value method requires several extractors to detect a table’s structure and extract the values inside, at least two extractors for every Data Column (one for its header and one for its values). Because of this, there are several components to configure in order to extract a table’s information. For relatively simple tables, Row Match ends up being simpler to set up, both being less time consuming and using fewer objects.

The Infer Grid method also has some advantages over Header-Value. There are some specialized use cases, such as reading OMR checkboxes in tables and reprocessing table cells using a secondary OCR profile, where Infer Grid does things the other two methods simply can’t. Infer Grid also performs well when table line information can be saved to a page’s layout data.

How To

Creating a Data Table in Grooper

Before you begin

A Data Table is a Data Element used to model and extract a table's information on a document. Just like other Data Elements, such as Data Fields and Data Sections, Data Tables are created as children of a Data Model. This guide assumes you have created a Content Model with a Data Model.

We will use the table below as our example for creating a Data Table.

Navigate to a Data Model

Using the Node Tree on the left side of Grooper Design Studio, navigate to the Data Model you wish to add the Data Table to. Data Tables can be created as children of any Data Model at any hierarchy in a Content Model.


Add a Data Table

Right click the Data Model object, mouse over "Add" and select "Data Table"



The following window will appear. Name the table whatever you would like and press "OK" when finished.



This creates a new Data Table object in the Node Tree underneath the Data Model.


Add Data Columns

Right click the Data Table object, mouse over "Add" and select "Data Column"



This brings up the following window to name the Data Column. When finished, press "OK" to create the object.



This creates a new Data Column object in the Node Tree underneath the Data Model.


Repeat Until Finished

Add as many columns as necessary to complete the table. For our example, we have a single Data Table with five Data Columns, each one named for the corresponding column on the document.


Image Processing for Tables

Tables present some specific image processing needs. This section goes over how to make an IP Profile with a few IP Commands useful for table extraction.

Configuring Header-Value for the Missing Cells Problem

Many tables have optional columns. Data may or may not exist in those cells for a given row. Since the Row Match method works by making patterns to match each row, this can cause problems. Sometimes, a single pattern doesn't cut it, and multiple patterns must be used in order to properly model each row. You may end up making multiple extractors, using multiple patterns, to account for every row's variation. One for if a value is in the optional column, one for if it is not there. The more optional columns on the table, the more variations in the row's pattern you have to account for. This can become very messy depending on the size of the table.

Header-Value works differently, rather than working off each row's pattern, it looks to the header labels and values underneath to figure out the table's structure. It can end up being simpler to configure and produce better results.

! Some of the tabs in this tutorial are longer than the others. Please scroll to the bottom of each step's tab before going to the step.

Before you begin

A Data Table is a Data Element used to model and extract a table's information on a document. Just like other Data Elements, such as Data Fields and Data Sections, Data Tables are created as children of a Data Model. This guide assumes you have created a Content Model with a Data Model.

We will use the table below as our example. This is part of an expense report detailing intangible costs (as opposed to material costs). The raw text data has already been extracted using OCR via the Recognize activity.


Add a Data Table

Create a Data Table with five Data Columns. The five columns for our example are "Description", "Code", "Dry Hole", "Completion", and "Total". Refer to the Creating a Data Table in Grooper section above for more information on adding a Data Table to a Data Model.


Set the Extract Method

First, set the "Extract Method" property to "Header-Value". (1) Select the Data Table object in the Node Tree, and (2) select the "Extract Method" property.



Using the dropdown list, select "Header-Value".



We will not configure any properties here for the time being. Much of the Header-Value method's setup is done on the Data Table's individual Data Column objects. We will configure those first.

Configure the Header Extractor

(1) Navigate to the "Description" Data Column in the Node Tree. (2) Select the "Header Extractor" property.



The extractor's "Type" can be "Internal" or "Reference". Choosing "Internal" will allow you to write a regular expression pattern straight from the property panel. "Reference" will allow you to point to an extractor built elsewhere in the Node Tree. Internal extractors are typically very simple patterns that don't need the extra properties available to Data Type extractors, such as collation, filtering or post-processing. Our case here is very simple. We will create the Header Extractor using an Internal extractor.

Expand the "Header Extractor" property, select the "Type" property and choose "Internal" from the dropdown list.



Next, select the "Pattern" property and press the ellipsis button at the end to bring up the Pattern Editor.



This brings up Grooper's Pattern Editor to write regular expression patterns. We are just looking to match the header label "INTANGIBLES".



A simple regex pattern intangibles will match just fine.



Press "OK" to save and close the Pattern Editor.



We now have a Header Extractor for the Description Data Column. Next, we need to find the values in that column. That will be done using the Data Column's Value Extractor property.

Configure the Value Extractor

The Data Column's Value Extractor is putting the "Value" in "Header-Value". This too can be either an "Internal" or "Reference" extractor.

FYI Version 2.8 added several new extraction methods available to Data Fields and Data Columns besides "Internal" and "Reference". "Internal" has been renamed "Text Pattern", but it functions exactly like an Internal pattern.

We will continue configuring the "Description" Data Column. This extractor will find the line item descriptions for each row below the "Intangibles" column. (1) Select the "Description" Data Column in the Node Tree. (2) Select the "Value Extractor" property and choose (3) "Text Pattern" from the dropdown list.



Expand the "Value Extractor" properties. Select "Pattern" and press the ellipsis button at the end to bring up the Pattern Editor.



We are going to use what we affectionately call the "God Mode Pattern" [^\r\n\t\f]+

This pattern matches anything that is not a control character or large whitespace of variable length (one character long to any number of characters long). This functionally segments the text on the page. For this table, every segment of text inside a table cell is captured by this pattern, including everything in the column labeled "Intangibles".

! Remember to enable Tab Marking the Preprocessing Options of the Properties tab.



! The God Mode Pattern works here because we have a clean table with clean OCR results. If you need to use fuzzy matching to get around OCR errors, you cannot use the variable length modifiers + or *. In these cases a more specific pattern would need to be written for the Value Extractor.

This is all we will configure on this column for the time being. Next, we will add Header Extractors and Value Extractors for the remaining four data columns.

Configure the Remaining Header Extractors

We need to find the header positions for the remaining four Data Columns, using the Header Extractor property. Their configuration will be similar to the "Description" Data Column. We will set the "Header Extractor" property's "Type" to internal and write a simple regex to match the corresponding column's label.

The "Code" Column

Select the "Code" Data Column. Expand the "Header Extractor" property. Select "Type" and choose "Internal" from the dropdown list.



Using the Pattern Editor, we will simply use the regex pattern code to match the Code column's header.



The "Dry Hole" Column

Select the "Dry Hole" Data Column. Expand the "Header Extractor" property. Select "Type" and choose "Internal" from the dropdown list.



Using the Pattern Editor, we will simply use the regex pattern dry hole to match the Dry Hole column's header.



The "Completion" Column

Select the "Completion" Data Column. Expand the "Header Extractor" property. Select "Type" and choose "Internal" from the dropdown list.



Using the Pattern Editor, we will simply use the regex pattern completion to match the Completion column's header. Press the "OK" button when finished.



The "Total" Column

Select the "Total" Data Column. Expand the "Header Extractor" property. Select "Type" and choose "Internal" from the dropdown list.



Using the Pattern Editor, we will simply use the regex pattern total to match the Total column's header.


Configure the Remaining Value Extractors - The "Code" Column

Now that we have all the Header Extractors configured, we need to finish up the Value Extractors on each Data Column. We will start with the "Code" Data Column.

The values in the code column are all seven digit numbers. We can make a fairly simple regular expression pattern to match these numbers. Select the Value Extractor property and choose "Text Pattern" from the drop down list.



Expand the "Value Extractor" property. Select the "Pattern" property and press the ellipsis button at the end.



Use the following regex to match each seven digit code in the column: \d{7}



Press "OK" to save and close the Pattern Editor. Don't forget to Save the Data Column as well.

Configure the Remaining Value Extractors - The "Dry Hole" "Completion" and "Total" Columns

Looking at this document, all of these columns are currency values. There's no need to write three separate Internal extractors for each Value Extractor in each Data Column. We can create a generic currency extractor and use it for all three of the remaining columns.

Right click the Local Resources folder of your Content Model in the Node Tree. Mouse over "Add" and select "Data Type". This will create a new Data Type extractor in the folder we can use for our Data Columns' Value Extractor.



A window will pop up to name the Data Type extractor. Name it "Generic - Currency". Press "OK" when finished.



Using the property panel, under the "Data Extraction" heading, select "Pattern" and press the ellipsis button at the end of the property. This is where we will write a generic regex to match currency values on the document.



The following regex will match all currency on this document: [$]\d{1,3},\d{3}

Press "OK" to exit the Pattern Editor when finished.



Press "Save" to finish configuring the "Generic - Currency" extractor.



Now that we have an extractor built matching currency values on the document, we just need to reference it on each Data Column's Value Extractor.

Select the "Dry Hole" Data Column in the Node Tree. Select the "Value Extractor" property and choose "Reference" from the dropdown list.



Expand the "Value Extractor" property. Select the "Extractor" property and using the dropdown menu locate the "Generic - Currency" extractor we created earlier. The dropdown menu replicates portions of the Node Tree. Expand folders as you would in the Node Tree to find the extractor you're looking for. The first folder in the list will be your Content Model's Local Resource folder. This is where we created the "Generic - Currency" extractor.



The Data Column will use the results of the referenced extractor when looking for the values below the extracted Header. Don't forget to Save!



The "Completion" and "Total" columns will also use the "Generic - Currency" extractor as their Value Extractors. Complete the same steps to reference the "Generic - Currency" extractor for the "Completion" and "Total" columns.

Set Primary Columns

If we were to stop right now and go back to our Data Table, and press the "Test Extraction" button, no results would appear.



This is because we haven't set a "Primary Column". Primary Columns are used in the Header-Value method to indicate what data must be present in a row to constitute a legitimate or valid table row.

For our example, the item descriptions in the column labeled "Intangibles" are always present. The item codes in the "Code" column are always present. And, the "Total" column is always filled in. Rows on these table would not be a valid row without those pieces of information. However, the "Dry Hole" and "Completion" columns may or may not have data in a given row.



! At least one Data Column must be made a Primary Column. Setting a Primary Column is part of how the Header-Value method determines a table's structure. If a cell is empty in a Primary Column, the Header-Value will skip that row. It will not "see" it as part of the table.

We will set the "Description", "Code", and "Total" Data Columns as Primary Columns. Let's start with "Description". Navigate the the "Description" Data Column in the Node Tree and select "Primary Column" under the "Column Settings" heading.



(1) Change the "Primary Column" property from "False" to "True". (2) Press the Save button when finished.



Follow the same process to make the "Code" and "Total" Data Columns Primary Columns.

Now that we have Primary Columns, the Header-Value method will be able to model the table's structure and extract the data inside. Navigate back to the Data Table and press the Test Extraction button.


Configuring Header-Value for the Variation Problem

! Some of the tabs in this tutorial are longer than the others. Please scroll to the bottom of each step's tab before going to the step.

Before you begin

A Data Table is a Data Element used to model and extract a table's information on a document. Just like other Data Elements, such as Data Fields and Data Sections, Data Tables are created as children of a Data Model. This guide assumes you have created a Content Model with a Data Model.

This tutorial goes after problems around variety. The Header-Value extraction method is very useful when targeting tables whose structure changes, but the desired information is contained across multiple table styles. These documents are several production reports filed with the Oklahoma Corporation Commission from various oil and gas companies. The raw text data has already been extracted using OCR via the Recognize activity.

Add a Data Table

Create a Data Table with four Data Columns. The four columns for our example are "Well Name", "PUN", "Prod Code", and "Volume". Refer to the Creating a Data Table in Grooper section above for more information on adding a Data Table to a Data Model.


Set the Extract Method

First, set the "Extract Method" property to "Header-Value". (1) Select the Data Table object in the Node Tree. (2) Select the "Extract Method" property. And (3) using the dropdown list, select "Header-Value".



Get a Game Plan

We went through a simple example of Header-Value extraction in the how to tutorial above. With a larger document set, it's a good idea to put in a little work on the front end to analyze the documents in font of you and get an idea of what your workflow is going to be. Let's break down what we're doing in Grooper for the Header-Value method.

1. Create Header Extractors

  • Each document will have their own way of labeling columns. For example, a Product Code may be on the page as "Prod Code", "Prod. Code", "PC", "Product Code" and more. A single pattern isn't going to cut it here. We'll want to create extractors for each column header and reference them on the Data Columns.

2. Create Value Extractors

  • Just as the columns have their own formats, the values in those columns likely will too. We can get an idea of extractors we'll want to write by looking through the document set. We'll probably need a somewhat generic text segment extractor to pull the well names. There are a couple formats we'll need to tackle for the production unit number. With variation in how values are on each table style as well, we'll want to build some extractors for each Data Column's Value Extractor as well.

3. Configure the Data Columns

  • This will mostly be setting references to the extractors we build, but we can start thinking about which Data Columns should be Primary Columns.

4. Configure the Header-Value properties on the Data Table

  • We will do some troubleshooting at the end to see if we need to adjust the default properties to get better results.

Create the Header Extractors

Let's Get Organized - Make a Column Headers Folder

In the Local Resources folder of the Content Model, add a folder named "Column Headers". We will create all our Header Extactors here.

Right click the Local Resources folder. Hover over "Add" and select "Folder..."



Name the folder "Column Headers" and press "OK".



Create a Header Extractor - Production Unit Number

First, we will create an extractor to find the "PUN" column headers. Right click the "Column Headers" folder. Hover over "Add" and select "Data Type". This will create a new Data Type extractor that we will configure to find all the variations of the header labels containing Production Unit Numbers.



Name it "COL HEAD - PUN" and press the "OK" button. This naming will help us know the extractor is for a column header (COL HEAD) targeting production unit numbers (PUN).



Now we need to write extraction using regular expression to match each document's column header. Select the "Pattern" property and press the ellipsis button at the end to bring up the Pattern Editor.



We have nine different reports in this document set. Almost all nine have a different way of labeling the column that contains the production unit number.

PRODUCTION UNIT #
OTC Lease Number / Production Unit Number (PUN)
STATE LEASE NUMBER
OKLAHOMA PUN
OTC LEASE NO.
Production Unit Number
OTC Lease Number / Production Unit Number (PUN) (again)
Production Unit Number (again)
PRODUCTION UNIT # (This one will be a special case)


Make a Regex List of Labels

For the first eight documents, we can create a list of regular expressions matching these labels separated by the vertical bar |. The following would be the regex entered in the "Value Pattern" window.

PRODUCTION UNIT #|
OTC Lease Number / Production Unit Number [(]PUN[)]|
STATE LEASE NUMBER|
OKLAHOMA PUN|
OTC LEASE NO[.]|
Production Unit Number
FYI Certain characters (called metacharacters) perform special operations in regular expression. These include the dot . which is a wildcard character that can match any character, and parenthesis ( and ) which are used to make groups. If you want to literally match these characters, they need to be "escaped". This can be done by placing a backslash \ before the character or place the character in a character set []

So, \. or [.] would match only a period on the page. And \( or [(] will match a starting parenthesis character on the page.



Press the "OK" button when finished to exit the Pattern Editor. Save your work on the "COL HEAD - PUN" extractor.



Ordered Arrays to Target Header Labels

If you recall, when we looked at all the variations of this header label, we noted one would be a special case. Most of our header labels read left to right on a single line. However, the table below reads top to bottom on two lines. Regular expression matches along a normal text flow, left to right, matching one character after the next. Since the label in this example is on two lines the simple regex PRODUCTION UNIT # does not match.



This is very common when dealing with tables. We need to account for this format by creating a child extractor which will return its result to its parent (The "COL HEAD - PUN" extractor). We will add a Data Type child extractor that matches each line individually and uses Ordered Array collation to combine the results, creating a single label for the Header Extractor to use. Since this will be a child extractor, its result will be combined with the list of header labels we wrote earlier.

FYI
Data Type extractors "inherit" the results of any extractor added as a "child". Every result the child extractors find on a page get passed up to the parent extractor. Child extractors can be a simple Data Format extractor or even another Data Type extractor (even with its own children).


Right click the "COL HEAD - PUN" extractor to add a child extractor. Hover over "Add" and select "Data Type"



Since we will use the Ordered Array collation looking to align items vertically, and we're putting two lines together, name the extractor "OAV - 2 Lines". Press the "OK" button to create the extractor.



This extractor will have two children of its own, both data formats. One looking for the word PRODUCTION and another for the phrase UNIT #. Add two Data Formats by right clicking the "OAV - 2 Line" extractor, selecting "Add" then "Data Format". Name the first "Line 1" and the second "Line 2".



For the first Data Format, "Line 1", write PRODUCTION in the Value Pattern box.



For the second Data Format, "Line 2", write UNIT # in the Value Pattern box.



Navigate back to the "OAV - 2 Line" extractor, the parent of these two Data Formats. For each page, we are getting individual results from each Data Format. By default, Data Type collation is set to "Individual" this gives us a list of all results from the child extractor, one after the other. We want to combine these two results into one.



Select the "Collation" property and using the dropdown list choose "Ordered Array".



Ordered Arrays combine the results of child extractors, in the order which they appear. This is important because it impacts the order in which you create the child extractors. We are looking to produce a result for "Production Unit #" not "Unit # Production". The "Line 1" format needs to come first because the first word is "Production"

Ordered Arrays also need a "layout". This can be "Horizontal" (in which results are in a line left to right) "Vertical" (in which results are in a line top to bottom) or "Flow" (in which results are in a text flow left to right, top to bottom). We need to use "Vertical Layout" since our two lines are stacked on top of each other.

Expand the "Collation" property. Select "Vertical Layout" and change the property from "Disabled" to "Enabled". This will combine the results of the child extractors, giving us a single result we can use to match the column header for this document.



<-==== Dealing With Bad OCR - FuzzyRegEx ===

Let's check our pattern and do some quick validation. Expand out the first document and quickly check each page.


900px


We don't have to go too far before we find a page where our pattern fails. Check out Page 10. We do not get a match here.


900px->