2.80:Header-Value (Table Extract Method): Difference between revisions

From Grooper Wiki
No edit summary
No edit summary
Line 1: Line 1:
[[img of property panel]]
[[File:Header val property panel.png|frame]]


<blockquote>
<blockquote style="font-size:14pt">
'''Header-Value''' is one of three methods available to Data Table elements to extract information from tables on a document set.  It uses a combination of column header and column value extractors to determine the table’s structure and extract information from the table’s cells.
'''Header-Value''' is one of three methods available to Data Table elements to extract information from tables on a document set.  It uses a combination of column header and column value extractors to determine the table’s structure and extract information from the table’s cells.
</blockquote>
</blockquote>
Line 9: Line 9:
Where the [[Row Match (Table Extract Method)|Row Match]] method focuses on using a table’s rows to model table structure and extract data, ‘’’Header-Value’’’ looks to the table’s columns.  Extractors are used to find the header labels and the values in those columns.  This is very similar to how a human being reads a table.  Imagine you're trying to find a piece of information from the table below, say a particular order identification number.  The first thing you're going to do is look for the Order ID column.  That is what the Header Extractor does.  Then, you're going to look for the number you want in that column.  That's what the column's Value Extractor is doing (Only, of course, our goal in Grooper will be to capture all the values in the column).
Where the [[Row Match (Table Extract Method)|Row Match]] method focuses on using a table’s rows to model table structure and extract data, ‘’’Header-Value’’’ looks to the table’s columns.  Extractors are used to find the header labels and the values in those columns.  This is very similar to how a human being reads a table.  Imagine you're trying to find a piece of information from the table below, say a particular order identification number.  The first thing you're going to do is look for the Order ID column.  That is what the Header Extractor does.  Then, you're going to look for the number you want in that column.  That's what the column's Value Extractor is doing (Only, of course, our goal in Grooper will be to capture all the values in the column).


[[img or imgs of star wars table showing what a header is and values are]]
{|cellpadding="10" cellspacing="5" style="margin:auto"
|
{|
|+ The Header Extractor locates a column's header label.</br></br>
|[[File:Header labels.png]]
|}
|
{|
|+ The Data Column's Value Extractor locates the column's values, using the Header Extractor's results as where to look down from.
|[[File:Column values.png]]
|}
|}
 


As the name implies both “Header” extractors and “Value” extractors are required for this method to function.  Configuring these extractors is done on each of the Data Columns.
As the name implies both “Header” extractors and “Value” extractors are required for this method to function.  Configuring these extractors is done on each of the Data Columns.


[[img of Data Column showing where this is done]]
 
 


--- loose explanation using Star Wars table of header-value extraction---
--- loose explanation using Star Wars table of header-value extraction---

Revision as of 10:20, 22 January 2020

Header-Value is one of three methods available to Data Table elements to extract information from tables on a document set. It uses a combination of column header and column value extractors to determine the table’s structure and extract information from the table’s cells.

About

Where the Row Match method focuses on using a table’s rows to model table structure and extract data, ‘’’Header-Value’’’ looks to the table’s columns. Extractors are used to find the header labels and the values in those columns. This is very similar to how a human being reads a table. Imagine you're trying to find a piece of information from the table below, say a particular order identification number. The first thing you're going to do is look for the Order ID column. That is what the Header Extractor does. Then, you're going to look for the number you want in that column. That's what the column's Value Extractor is doing (Only, of course, our goal in Grooper will be to capture all the values in the column).

The Header Extractor locates a column's header label.

The Data Column's Value Extractor locates the column's values, using the Header Extractor's results as where to look down from.


As the name implies both “Header” extractors and “Value” extractors are required for this method to function. Configuring these extractors is done on each of the Data Columns.



--- loose explanation using Star Wars table of header-value extraction---

Version Differences

Use Cases

The Header-Value method is the second table extraction method created in Grooper. It was made to target tables not easily extracted by Row Match. Row Match looses its efficiency once a tables structure starts to change from document to document. Different companies are going to structure tables however they want, which is well outside your control. Think of all the different ways an invoice can be structured. While the information you want is present in all the different tables, how that data is presented may not be consistent. Even just the column location changing can present problems for this method. Row Match’s method of using a Row Extractor to pattern the table may not be able to do the job (or a complicated Row Extractor accounting for multiple row formats may need to be used). For these situations, the Header-Value method is often easier to configure and produces better results.

These are different Oil and Gas Production Reports from various sources. Each one organizes information differently into tables in different ways. Row Match would work just fine for each individual document. However, while the same information exists on each document, there's enough variability in the table structures that Row Match may not be suited for processing the whole document set. Header-Value is usually a better route.

Optional data columns, where values may or may not be present in a cell, can complicate things for Row Match as well. Again, a simple Row Extractor may not do the trick. While a more complicated extractor may successfully extract the table's information, the Header-Value method (or the Infer Grid) may be simpler to set up and produce the same or even better results.

However, the Header-Value method does have its limitations. Perhaps most obviously, header labels are necessary for this method to work. In tables where header labels are not present, Header-Value will not be suitable for use.

Furthermore, the Header-Value method requires several extractors to detect a table’s structure and extract the values inside, at least two extractors for every Data Column (one for its header and one for its values). Because of this, there are several components to configure in order to extract a table’s information. For relatively simple tables, Row Match ends up being simpler to set up, both being less time consuming and using fewer objects.

The Infer Grid method also has some advantages over Header-Value. There are some specialized use cases, such as reading OMR checkboxes in tables and reprocessing table cells using a secondary OCR profile, where Infer Grid does things the other two methods simply can’t. Infer Grid also performs well when table line information can be saved to a page’s layout data.

How To

Configuring Header-Value for the Missing Cells Problem

Configuring Header-Value for the Variation Problem