2.80:Table Extraction (Concept): Difference between revisions
Configadmin (talk | contribs) No edit summary |
Configadmin (talk | contribs) No edit summary |
||
Line 3: | Line 3: | ||
Tables are one of the most common ways data is organized on documents. Human beings have been writing information into tables before they started writing literature, even before paper was invented. They are excellent structures for representing a lot of information with various characteristics in common in a relatively small space. However, targeting the data inside them presents its own set of challenges. A table’s structure can range from simple and straightforward to more complex (and even confounding). Different organizations may organize the same data differently, creating different tables for what, essentially, is the same data. | Tables are one of the most common ways data is organized on documents. Human beings have been writing information into tables before they started writing literature, even before paper was invented. They are excellent structures for representing a lot of information with various characteristics in common in a relatively small space. However, targeting the data inside them presents its own set of challenges. A table’s structure can range from simple and straightforward to more complex (and even confounding). Different organizations may organize the same data differently, creating different tables for what, essentially, is the same data. | ||
In Grooper, tabular data can be extracted using the [[Row Match (Table | In Grooper, tabular data can be extracted using the [[Row Match (Table Extract Method)|Row Match]], [[Header-Value (Table Extract Method)|Header-Value]], or [[Infer Grid (Table Extract Method)|Infer Grid]] table extraction methods. | ||
== What Is a Table? == | == What Is a Table? == | ||
Tables consists of rows and columns. Where those rows and columns intersect are cells. These are the individual units of the table containing individual pieces of data. Each row consists of the same number of columns (although some columns may be empty in a given row). A single column consists of the same type of information. For example, an "Order Date" column will always have dates in the cells below it. The rows themselves are usually (but not always) in some order as well, such as in order of ascending date. | Tables consists of rows and columns. Where those rows and columns intersect are cells. These are the individual units of the table containing individual pieces of data. Each row consists of the same number of columns (although some columns may be empty in a given row). A single column consists of the same type of information. For example, an "Order Date" column will always have dates in the cells below it. The rows themselves are usually (but not always) in some order as well, such as in order of ascending date. | ||
{|style="margin:auto" | |||
|[[File:Table row.png|300px]]||[[File:Table column.png|300px]]||[[File:Table cell.png|300px]] | |||
|} | |||
This may seem obvious, but understanding how data is structured on the page informs how you will use Grooper to target it. | This may seem obvious, but understanding how data is structured on the page informs how you will use Grooper to target it. | ||
{| | |||
[[file:node tree table extraction.png|left]] | |||
|- | |||
|In Grooper, tables are represented as [[Data Table]] objects in a [[Data Model]]. Each column is represented as a [[Data Column]] object, created as a children of the [[Data Table]]. Rows and their individual cells are created and populated upon successful data extraction. | |||
How raw text data is targeted and extracted to populate each row in the table is determined by the Data Table object's "Extract Method" property. This can be done using either the [[Row Match (Table Extract Method)|Row Match]], the [[Header-Value (Table Extract Method)|Header-Value]], or the [[Infer Grid (Table Extract Method)|Infer Grid]] method. | |||
To the left is the Grooper representation of a table, as seen in the [[Node Tree]]. | |||
|} |
Revision as of 12:32, 7 January 2020

Tables are one of the most common ways data is organized on documents. Human beings have been writing information into tables before they started writing literature, even before paper was invented. They are excellent structures for representing a lot of information with various characteristics in common in a relatively small space. However, targeting the data inside them presents its own set of challenges. A table’s structure can range from simple and straightforward to more complex (and even confounding). Different organizations may organize the same data differently, creating different tables for what, essentially, is the same data.
In Grooper, tabular data can be extracted using the Row Match, Header-Value, or Infer Grid table extraction methods.
What Is a Table?
Tables consists of rows and columns. Where those rows and columns intersect are cells. These are the individual units of the table containing individual pieces of data. Each row consists of the same number of columns (although some columns may be empty in a given row). A single column consists of the same type of information. For example, an "Order Date" column will always have dates in the cells below it. The rows themselves are usually (but not always) in some order as well, such as in order of ascending date.
![]() |
![]() |
![]() |
This may seem obvious, but understanding how data is structured on the page informs how you will use Grooper to target it.

In Grooper, tables are represented as Data Table objects in a Data Model. Each column is represented as a Data Column object, created as a children of the Data Table. Rows and their individual cells are created and populated upon successful data extraction.
How raw text data is targeted and extracted to populate each row in the table is determined by the Data Table object's "Extract Method" property. This can be done using either the Row Match, the Header-Value, or the Infer Grid method. To the left is the Grooper representation of a table, as seen in the Node Tree. |