Data Table (Node Type)
|
STUB |
This article is a stub. It contains minimal information on the topic and should be expanded. |
A table Data Table is a Data Element specialized in extracting tabular data from documents (i.e. data formatted in rows and columns).
- The Data Table itself defines the "Table Extract Method". This is configured to determine the logic used to locate and return the table's rows.
- The table's columns are defined by adding view_column Data Column nodes to the Data Table (as its children).
Data is extracted using one of the following Table Extraction methods. Each of these methods take a different approach in modeling a table's structure:
About
Many documents contain data in a table, presented on the page as some kind of grid of information divided into rows and columns. The Data Table object's purpose is to define the processing logic to model and collect tabular data.
A Data Table can be added to your Data Model to extract data from the table's cells. Once added to your Data Model, you will add Data Columns to the Data Table. You can add as many columns as you need to collect data from all (or only some) of the table's columns on the document. Data Columns will exist as children of the Data Table in the node hierarchy.
- Data Columns also allow for additional configuration, such as assigning the Value Type for the extracted data in that column (decimal, string, Boolean, etc).
- There are several table extraction methods. Some will require configuration of a Data Table's Data Columns. Others will not (or allow for optional configurations). Please visit the Table Extraction article for more information on tabular data extraction in general.
|
FYI |
Generally, the first Data Column underneath your Data Table will correspond to the leftmost column and the last will be the rightmost. The top Data Column lines up with the first column, the second Data Column the second column, and so on. However, this is not strictly necessary. You can re-order the column order how you see fit. You can change the order of columns within your Data Table by right clicking a Data Column in the Node Tree and choosing either "Move Up" or "Move Down". Keyboard shortcuts are also available. "Move Up" is |
Table Extract Methods
There are six different extraction methods available in Grooper. Using the Data Table's Extract Method property you will select and configure one of the following:
- Row Match - This uses an extractor to match each row. You could reference a Data Type extractor that returns each whole row in the table to populate the rows in the Data Table.
- Grid Layout - This method creates a grid from header positions, using extractors to match column and (sometimes optionally) row headers. Once the grid is created ("inferred" from the column and row header positions) it extracts the corresponding text data from the cells within the grid.
- Tabular Layout - This method is an improvement upon the Header-Value method. It also detects a table's layout using a table's column headers and value extractors defined on the Data Column objects. However, in general, there is much less configuration required up front with more ability to fine tune configuration according to your needs. This method also can make efficient use of Label Sets to aid in table extraction.
- Fluid Layout - This method requires Label Sets in order to function. It can be configured in a way to use either the Row Match or the Tabular Layout method based on how a Document Type's labels are collected.
- Delimited Extract - This method allows for efficient extraction of character delimited text files, such as CSV files.
- Fixed Width - This method reads tabular data from "fixed width" formatted text files.
|
FYI |
Older versions of Grooper included another table extract method: Header-Value Tabular Layout was created as an improved version of the Header-Value method. However, both methods existed in some Grooper versions. In version 2023.1, Tabular Layout wholly replaced Header-Value.
|
Property Details
This section expands on the Grooper documentation for various Data Table properties.
Maximum Display Rows
The Maximum Display Rows property specifically has to do with how rows are displayed in the Data Viewer when a user executes the Review activity.
Imagine you have a data dense document with a table with several hundred rows. Grooper extracts the table and now it's time to present it to a data entry clerk in Review. It's going to take Grooper a while to load that Data Table. This can be an unnecessary lag point in the user's review experience.
The Maximum Display Rows property allows you to dynamically load rows. Instead of loading them all at once, you can restrict this to only load, say 50 at a time. After you scroll to the bottom of the first 50 loaded rows, the next 50 will load, then the next, and so on until you reach the end of the table. This way, the user doesn't have to wait for the entire table to load up front and can start reviewing the extracted data quicker.
Glossary
Data Column: view_column Data Columns represent columns in a table extracted from a document. They are added as child nodes of a table Data Table. They define the type of data each column holds along with its data extraction properties.
- Data Columns are frequently referred to simply as "columns".
- In the context of reviewing data in a Data Viewer, a single Data Column instance in a single Data Table row, is most frequently called a "cell".
Data Model: data_table Data Models are leveraged during the Extract activity to collect data from documents (folder Batch Folders). Data Models are the root of a Data Element hierarchy. The Data Model and its child Data Elements define a schema for data present on a document. The Data Model's configuration (and its child Data Elements' configuration) define data extraction logic and settings for how data is reviewed in a Data Viewer.
Data Table: A table Data Table is a Data Element specialized in extracting tabular data from documents (i.e. data formatted in rows and columns).
- The Data Table itself defines the "Table Extract Method". This is configured to determine the logic used to locate and return the table's rows.
- The table's columns are defined by adding view_column Data Column nodes to the Data Table (as its children).
Data Type: pin Data Types are nodes used to extract text data from a document. Data Types have more capabilities than quick_reference_all Value Readers. Data Types can collect results from multiple extractor sources, including a locally defined extractor, child extractor nodes, and referenced extractor nodes. Data Types can also collate results using Collation Providers to combine, sift and manipulate results further.
Delimited Extract: The Delimited Extract Table Extract Method extracts tabular data from a delimiter-separated text file, such as a CSV file.
Document Type: description Document Type nodes represent a distinct type of document, such as an invoice or a contract. Document Types are created as child nodes of a stacks Content Model or a collections_bookmark Content Category. They serve three primary purposes:
- They are used to classify documents. Documents are considered "classified" when the folder Batch Folder is assigned a Content Type (most typically, a Document Type).
- The Document Type's data_table Data Model defines the Data Elements extracted by the Extract activity (including any Data Elements inherited from parent Content Types).
- The Document Type defines all "Behaviors" that apply (whether from the Document Type's Behavior settings or those inherited from a parent Content Type).
Extract: export_notes Extract is an Activity that retrieves information from folder Batch Folder documents, as defined by Data Elements in a data_table Data Model. This is how Grooper locates unstructured data on your documents and collects it in a structured, usable format.
Fluid Layout: The Fluid Layout Table Extract Method will choose between Tabular Layout and Flow Layout configurations, depending on how labels are collected for a description Document Type.
Grid Layout: The Grid Layout Table Extract Method uses the positional location of row and column headers to interpret where a tabular grid would be around each value in a table and extract values from each cell in the interpreted grid.
Node Tree: The Node Tree is the hierarchical list of Grooper node objects found in the left panel in the Design Page. It is the basis for navigation and creation in the Design Page.
Review: person_search Review is an Activity that allows user attended review of Grooper's results. This allows human operators to validate processed contract Batch Page and folder Batch Folder content using specialized user interfaces called "Viewers". Different kinds of Viewers assist users in reviewing Grooper's image processing, document classification, data extraction and operating document scanners.
Row Match: The Row Match Table Extract Method uses regular expression pattern matching to determine a tables structure based on the pattern of each row and extract cell data from each column.
Table Extract Method: A Table Extract Method defines the settings and logic for a table Data Table to perform extraction. It is set by configuring the Extract Method property of the Data Table.
Table Extraction: "Table Extraction" refers to Grooper's ability to extract data from cells in tables on documents. This is accomplished by configuring the table Data Table and its child view_column Data Column elements in a data_table Data Model.
Tabular Layout: The Tabular Layout Table Extract Method uses column header values determined by the view_column Data Columns Header Extractor results (or labels collected for the Data Columns when a Labeling Behavior is enabled) as well as Data Column Value Extractor results to model a table's structure and return its values.