2.80:Header-Value (Table Extract Method): Difference between revisions

Revision as of 10:54, 23 January 2020

Header-Value is one of three methods available to Data Table elements to extract information from tables on a document set. It uses a combination of column header and column value extractors to determine the table’s structure and extract information from the table’s cells.

About

Where the Row Match method focuses on using a table’s rows to model table structure and extract data, ‘’’Header-Value’’’ looks to the table’s columns. Extractors are used to find the header labels and the values in those columns. This is very similar to how a human being reads a table. Imagine you're trying to find a piece of information from the table below, say a particular order identification number. The first thing you're going to do is look for the Order ID column. That is what the Header Extractor does. Then, you're going to look for the number you want in that column. That's what the column's Value Extractor is doing (Only, of course, our goal in Grooper will be to capture all the values in the column).

The Header Extractor locates a column's header label.

The Data Column's Value Extractor locates the column's values, using the Header Extractor's results as where to look down from.

As the name implies both “Header” extractors and “Value” extractors are required for this method to function. Configuring these extractors is done on each of the Data Columns.

--- loose explanation using Star Wars table of header-value extraction---

Version Differences

Use Cases

The Header-Value method is the second table extraction method created in Grooper. It was made to target tables not easily extracted by Row Match. Row Match looses its efficiency once a tables structure starts to change from document to document. Different companies are going to structure tables however they want, which is well outside your control. Think of all the different ways an invoice can be structured. While the information you want is present in all the different tables, how that data is presented may not be consistent. Even just the column location changing can present problems for this method. Row Match’s method of using a Row Extractor to pattern the table may not be able to do the job (or a complicated Row Extractor accounting for multiple row formats may need to be used). For these situations, the Header-Value method is often easier to configure and produces better results.

These are different Oil and Gas Production Reports from various sources. Each one organizes information differently into tables in different ways. Row Match would work just fine for each individual document. However, while the same information exists on each document, there's enough variability in the table structures that Row Match may not be suited for processing the whole document set. Header-Value is usually a better route.

Optional data columns, where values may or may not be present in a cell, can complicate things for Row Match as well. Again, a simple Row Extractor may not do the trick. While a more complicated extractor may successfully extract the table's information, the Header-Value method (or the Infer Grid) may be simpler to set up and produce the same or even better results.

However, the Header-Value method does have its limitations. Perhaps most obviously, header labels are necessary for this method to work. In tables where header labels are not present, Header-Value will not be suitable for use.

Furthermore, the Header-Value method requires several extractors to detect a table’s structure and extract the values inside, at least two extractors for every Data Column (one for its header and one for its values). Because of this, there are several components to configure in order to extract a table’s information. For relatively simple tables, Row Match ends up being simpler to set up, both being less time consuming and using fewer objects.

The Infer Grid method also has some advantages over Header-Value. There are some specialized use cases, such as reading OMR checkboxes in tables and reprocessing table cells using a secondary OCR profile, where Infer Grid does things the other two methods simply can’t. Infer Grid also performs well when table line information can be saved to a page’s layout data.

How To

Creating a Data Table in Grooper

PrereqsStep 1Step 2Step 3Step 4

Before you begin

A Data Table is a Data Element used to model and extract a table's information on a document. Just like other Data Elements, such as Data Fields and Data Sections, Data Tables are created as children of a Data Model. This guide assumes you have created a Content Model with a Data Model.

We will use the table below as our example for creating a Data Table.

Navigate to a Data Model

Using the Node Tree on the left side of Grooper Design Studio, navigate to the Data Model you wish to add the Data Table to. Data Tables can be created as children of any Data Model at any hierarchy in a Content Model.

Add a Data Table

Right click the Data Model object, mouse over "Add" and select "Data Table"

The following window will appear. Name the table whatever you would like and press "OK" when finished.

This creates a new Data Table object in the Node Tree underneath the Data Model.

Add Data Columns

Right click the Data Table object, mouse over "Add" and select "Data Column"

This brings up the following window to name the Data Column. When finished, press "OK" to create the object.

This creates a new Data Column object in the Node Tree underneath the Data Model.

Repeat Until Finished

Add as many columns as necessary to complete the table. For our example, we have a single Data Table with five Data Columns, each one named for the corresponding column on the document.

Configuring Header-Value for the Missing Cells Problem

Many tables have optional columns. Data may or may not exist in those cells for a given row. Since the Row Match method works by making patterns to match each row, this can cause problems. Sometimes, a single pattern doesn't cut it, and multiple patterns must be used in order to properly model each row. You may end up making multiple extractors to account for every row's variation, one for if a value is in the optional column, one for if it is not there. The more optional columns on the table, the more variations in the row's pattern you have to account for. This can become very messy depending on the size of the table.

Header-Value works differently, rather than working off each row's pattern, it looks to the header labels and values underneath to

@@ Line 37: / Line 37: @@
 {|style="margin:auto"
-|+These are different Oil and Gas Production Reports from various sources.  Each one organizes information differently into tables in different ways.  Row Match would work just fine for each individual document.  However, while the same information exists on each document, there's enough variability in the table structures that Row Match may not be suited for processing the whole document set.  [[Header-Value (Table Extract Method)|Header-Value]] is usually a better route.
+|+These are different Oil and Gas Production Reports from various sources.  Each one organizes information differently into tables in different ways.  Row Match would work just fine for each individual document.  However, while the same information exists on each document, there's enough variability in the table structures that Row Match may not be suited for processing the whole document set.  Header-Value is usually a better route.
 |[[File:50939766 Page 1.png|border|200px]]||[[File:Spreadsheet with lines.png|border|200px]]||[[File:50946163.png|border|200px]]||[[File:50946164 Page 1.png|border|200px]]
 |}
@@ Line 51: / Line 51: @@
 == How To ==
-== Configuring Header-Value for the Missing Cells Problem ==
+=== Creating a Data Table in Grooper ===
-== Configuring Header-Value for the Variation Problem ==
+<tabs style="margin:20px">
+<tab name="Prereqs" style="margin:20px">
+==== Before you begin ====
+A Data Table is a Data Element used to model and extract a table's information on a document.  Just like other [[Data Element]]s, such as [[Data Field]]s and [[Data Section]]s, Data Tables are created as children of a [[Data Model]].  This guide assumes you have created a [[Content Model]] with a [[Data Model]].
+We will use the table below as our example for creating a Data Table.
+[[File:Simpletable.png|center]]
+</tab>
+<tab name="Step 1" style="margin:20px">
+==== Navigate to a Data Model ====
+Using the [[Node Tree]] on the left side of Grooper Design Studio, navigate to the [[Data Model]] you wish to add the Data Table to.  Data Tables can be created as children of any Data Model at any hierarchy in a Content Model.
+[[file:create a data table 1.png|900px]]
+</tab>
+<tab name="Step 2" style="margin:20px">
+==== Add a Data Table ====
+Right click the [[Data Model]] object, mouse over "Add" and select "Data Table"
+[[file:create a data table 2.png|900px]]
+The following window will appear.  Name the table whatever you would like and press "OK" when finished.
+[[file:create a data table 3.png|center]]
+This creates a new Data Table object in the [[Node Tree]] underneath the [[Data Model]].
+[[file:create a data table 4.png|center|900px]]
+</tab>
+<tab name="Step 3" style="margin:20px">
+==== Add Data Columns ====
+Right click the Data Table object, mouse over "Add" and select "Data Column"
+[[File:Create data table 5.png|900px]]
+This brings up the following window to name the Data Column.  When finished, press "OK" to create the object.
+[[file:create data table 6.png|center]]
+This creates a new Data Column object in the [[Node Tree]] underneath the [[Data Model]].
+[[file:create data table 7.png|900px]]
+</tab>
+<tab name="Step 4" style="margin:20px">
+==== Repeat Until Finished ====
+Add as many columns as necessary to complete the table.  For our example, we have a single Data Table with five Data Columns, each one named for the corresponding column on the document.
+[[file:create data table 8.png|900px]]
+</tab>
+</tabs>
+=== Configuring Header-Value for the Missing Cells Problem ===
+Many tables have optional columns.  Data may or may not exist in those cells for a given row.  Since the Row Match method works by making patterns to match each row, this can cause problems.  Sometimes, a single pattern doesn't cut it, and multiple patterns must be used in order to properly model each row.  You may end up making multiple extractors to account for every row's variation, one for if a value is in the optional column, one for if it is not there.  The more optional columns on the table, the more variations in the row's pattern you have to account for.  This can become very messy depending on the size of the table.
+Header-Value works differently, rather than working off each row's pattern, it looks to the header labels and values underneath to
+=== Configuring Header-Value for the Variation Problem ===