Tabular Layout (Table Extract Method)

From Grooper Wiki
(Redirected from Tabular Layout)

WIP

This article is a work-in-progress or created as a placeholder for testing purposes. This article is subject to change and/or expansion. It may be incomplete, inaccurate, or stop abruptly.

This tag will be removed upon draft completion.


This article is about the current version of Grooper.

Note that some content may still need to be updated.

20252024 20232021

Tabular Layout is a Table Extract Method that uses column header values determined by the view_column Data Columns Header Extractor results (or labels collected for the Data Columns when a Labeling Behavior is enabled) as well as Data Column Value Extractor results to model a table's structure and return its values.

Introduction

The Tabular Layout Table Extract Method is a powerful tool in Grooper for extracting structured tabular data from documents. It automatically detects table headers, rows, and footers using a combination of value extractors and layout analysis. Tabular Layout is ideal for documents where tables are clearly defined, such as invoices, statements, and reports.

Unlike other Table Extraction Methods (such as Row Match or Delimited Extract), Tabular Layout leverages header and footer labels, supports multi-line and stacked layouts, and provides advanced configuration for handling complex table structures.

When to use

Tabular Layout is best used when:

  • Tables have clearly defined headers and rows.
  • You need to extract data from grid-based tables, including those with merged or stacked cells.
  • Tables may span multiple pages or regions.

Example: Use Tabular Layout to extract line items from an invoice where each row contains "Quantity," "Description," "Unit Price," and "Total," and headers are present.

Drawbacks:

  • Tabular Layout may be less effective for highly irregular tables or lists without clear headers.
  • For simple delimited data (e.g., CSV), Delimited Extract may be more efficient.
  • Requires well-defined header labels or extractors for best results.

What is a table?

A table in document processing is a structured arrangement of data in rows and columns. Its main components are:

  • Headers: The top section that labels each column (e.g., "Quantity", "Description").


  • Rows: Horizontal groupings of related data, each representing a record or item.


  • Columns: Vertical divisions, each capturing a specific type of data (e.g., price, date).


  • Footers: The bottom section, often used for totals or summary information.


Common use cases for tables in documents include:

  • Invoice line items
  • Transaction logs
  • Product lists
  • Financial summaries

Basic setup

Grooper must be able to detect the columns and rows of a table to extract data. The Tabular Layout does this by identifying the column headers, which indicates where the columns are located on the document. Then at least one Value Extractor must be set on a Data Column that will return a result on each row of the table, giving Grooper context for where the rows of the table are located.

Step 1: Create the Data Elements and select the Extract Method It is assumed that you already have a Project set up in Grooper with a Content Model, Document Type, and Data Model already created in Grooper before following these instructions.

  1. Right click on your Data Model.
  2. Hover over "Add" and select "Data Table..." from the fly out menu.
  3. When the "Add" window appears, enter a name for your Data Table in the Name property.
  4. When satisfied with the naming, click "Execute" to add the Data Table.
  5. Add the Data Columns as children of the Data Table using one of the following methods:
    • One at a time
      1. Right-click on the Data Table.
      2. Hover-over "Add" and select "Data Column..." from the fly out menu.
      3. when the "Add" window appears, enter a name for your Data Column in the Name property.
      4. When satisfied, click "Execute" to create the Data Column.
      5. Repeat steps 1-4 to add as many Data Columns as you would like.
    • Multiple at once
      1. Right-click on the Data Table.
      2. Hover-over "Contents" and click "Add Multiple Items..." from the fly out menu.
      3. When the "Add Multiple Items" window appears, make sure the Item Type property is set to Data Column.
      4. Click the "..." icon to the right of the Item Names property.
      5. When the Item Names window appears, type in the names you want to give to the Data Columns in the text box. Hit enter after each name.
      6. When finished, click "OK".
      7. Back on the Add Multiple Items window, click "Execute" to create the Data Columns.
  6. Next, select the Data Table in your Node Tree.
  7. Click the "☰" to the right of the Extract Method property.
  8. Click on "Tabular Layout" in the drop out menu.
  9. Click the save icon at the top of the property grid to save your changes to the Data Table.

Step 2: Configure Header Extractors

Now that we have the Data Table Extractor set to Tabular Layout, we need to give Grooper some information to determine where tables are located on a document. To do this, we first need to define where the columns of the table are. We can do this by setting a Header Extractor for each column header on each Data Column node under the Data Table.

  1. Select the first Data Column under your Data Table in your node tree.
  2. Locate the Header Extractor property in the property grid and click the "☰" to the right of the property to access the drop down.
  3. Select an Extractor to use to extract the header of the column.
  4. Configure that extractor to return the text of the corresponding column header on the document.
  5. Save any changes made to the Data Column.
  6. Repeat steps 1-5 for each Data Column in your Data Table.


Step 3: Assign Value Extractors to Data Columns

The Data Table Extract Method has been set to Tabular Layout and Headers defined on each Data Column with a Header Extractor. Next, we need to give Grooper a little more context to figure out where the rows of our table are. We do this by setting Value Extractors on the Data Columns.

You will need to set a Value Extractor on at least one Data Column. Grooper will use that extractor to determine where the rows in your table are. If you are having issues with Grooper detecting the rows of the table accurately, you can add Value Extractors to other Data Columns to give Grooper more to work with. There are other ways to improve accuracy, which are discussed later in the article.

To set a Value Extractor on a Data Column, follow these instructions:

  1. Select a Data Column in your Node Tree.
  2. Set an extractor on the Value Extractor property to collect values located in the Table's column.
    • Pattern Match is commonly used for Value Extractors on Data Columns, but any extractor can be used.
    • You can also set the Value Extractor to a Reference.
  3. Save your changes.
  4. Set more Value Extractors to the other Data Columns if needed for accurate extraction.

Using a Header Row Extractor

A Header Row is the line (or lines) at the top of a table that contains the column labels, such as "Item No.", "Description", "Qty.", or "Total". In document processing, the header row provides essential context for identifying and aligning data in each column. Accurate header detection ensures that extracted values are mapped to the correct Data Columns, even when table layouts vary between documents.

A Header Row Extractor is a specialized Value Extractor that detects the entire header row at once, rather than relying on individual header extractors for each column. This approach is especially useful when:

  • Table headers span multiple lines or have complex formatting.
  • Column order varies between documents.
  • You want to simplify configuration and improve robustness for tables with dynamic layouts.

Using a Header Row Extractor can reduce manual setup, improve extraction accuracy, and make your solution more adaptable to different document types.

Why use a Header Row Extractor?

  • Use when header rows follow a predictable format or pattern.
  • Can configure a Headers in one place rather than having to set Header Extractors on each Data Column.
  • Using a Header Row Extractor can potentially throw out false positive column matches.
  • Can be a better way to take advantage of fuzzy RegEx.

Creating the Extractor

There are two main methods for creating a Header Row Extractor in Grooper:

  1. Using Named Groups in a pattern-based extractor.
  2. Using Child Extractors within a Data Type.

Either one will give you the same result. Which one you choose is completely your preference.

Using Named Groups

A Named Group is part of a regular expression pattern that captures a specific portion of text and assigns it a name. In Grooper, named groups are used to map header labels directly to Data Columns.

  • Named Group Syntax:
(?<Named_Group>RegEx Pattern)

For example, to capture a header row with "Item No.", "Description", and "Qty.", you might use:

(?<ItemNo>Item\s*No\.?)\s+(?<Description>Description)\s+(?<Qty>Qty\.?)
  • Step-by-Step: Configuring a Header Row Extractor Using Named Groups
  1. Create a new Extractor Object such as a Data Type or Value Reader.
  2. Set the Value Extractor on the Extractor Object to a Pattern Match.
  3. Write a Regular Expression to return the full header row of the table.
  4. In the extractor's pattern, use named groups for each individual column header you want to detect.
    • Ensure each named group matches the corresponding column label in your table.
    • If a Data Column has a name with a space in it, use an underscore in place of the space in the group name.

The following RegEx pattern is used in the example below. If you would like to follow along, feel free to copy out the RegEx.

(?<Description>DESCRIPTION)\t
(?<Quantity>HRS / QTY)\t
(?<Unit_Price>RATE / PRICE)\t
(?<Line_Total>SUBTOTAL)

Using Named Child Extractors

Alternatively, you can use a Data Type with multiple child extractors—one for each header label. This method is ideal when header labels are complex or require different extraction logic.

  • How It Works:

- Each child extractor is configured to match a specific header label. - The Collation Provider can be used to collate results from child extractors, referencing each by name. - The combined extractor is assigned as the Header Row Extractor.

  • Step-by-Step: Configuring a Header Row Extractor Using Child Extractors
  1. Create a new Data Type for the header row.
  2. Add a child extractor object (Data Type or Value Reader) for each column header (e.g., "Description", "Qty.", "Unit Price").
    • Ensure that each child extractor object is named exactly the same as the Data Column it corresponds with.
  3. Use a Collation Provider such as Ordered Array to collate the child extractors into a single result.

How to Set the Header Row Extractor in Tabular Layout

Once you have created a Row Extractor, you need to set it on the Header Row Extractor Property.

To use your Header Row Extractor in Tabular Layout:

  1. Select the Data Table node in your node tree.
  2. Set the "Extract Method" property to Tabular Layout if not already set.
  3. Expand the "Header Detection" property and locate "Header Row Extractor".
  4. Set the Header Row Extractor to a Reference.
  5. Assign your configured extractor (either with named groups or as a Data Type with child extractors) to the reference.
  6. Save your changes and run a test extraction to confirm the Header Row Extractor works.

Problems with unlined tables

When working with a table that does not have lines giving context to where the rows and columns of the tables are, Grooper has a harder time figuring out where columns are. Using the named child extractors method of configuring a Header Row Extractor relies on lines to determine columns. When you do not have lines, you can run into issues with your extraction.

If you have a table without lines on your documents that you want extracted, it is recommended to use a different method for extracting the data. You may want to set Header Extractors on each Data Column, or look into using Label Sets to detect headers.

Using label sets within Tabular Layout

A Label Set is a group of labels associated with a specific Document Type. Each label represents a possible way a data element might be named or presented in a document. For example, a Label Set for invoices might include "Invoice Number", "Inv #", and "Bill No.", all mapped to the same Data Field.

Label Sets are managed using the "Labels" tab on the Design Page for any Content Type with a Labeling Behavior enabled.

Why use Label Sets

Label Sets enable header- and footer-driven detection for tables. Tabular Layout will:

  • Read table headers from the Label Set to locate and align columns (Header Detection)
  • Read optional footer labels to establish the end of the table (Footer Detection).

This approach is ideal for semi-structured documents where the same data appears with different labels or column order.

How Tabular Layout uses Label Sets

  • Header detection: The engine reads table and column labels from the Label Set to build a Table Header Collection and snap header cells to geometric bounds. This improves alignment for value extraction across rows.
  • Footer detection: When a table's Footer label exists, it establishes the table's end line. Tabular Layout stops row detection above the footer and can optionally capture the footer row as data.
  • Column alignment: With labels present, columns are aligned to their labeled header bounds—even when column order varies—yielding consistent cell extraction.

Tips

  • Prefer header labels that cover the full header cell without vertical overlap.
  • Use "Dynamic Column Ordering" on the Data Table when documents rearrange columns.
  • For unlabeled or irregular tables, rely more on each column’s "Value Extractor" and the Tabular Layout Options fallback modes.

Benefits:

  • Rapid onboarding of new document types.
  • Increased extraction accuracy for tables with variable layouts.
  • Enables label-driven classification and extraction.

Drawbacks:

  • Requires consistent labeling on documents.
  • May need supplemental extractors for unlabeled data.

Pros and cons vs. traditional Tabular Layout

Pros

  • Faster onboarding: define labels once per Content Type; minimal per-document tuning.
  • Higher accuracy: header cells and footer rows are detected via label text (reduces false positives).
  • Supports dynamic column order: columns are aligned to their labeled headers rather than fixed positions.
  • Works with multi-line/stacked headers when labels identify the full header region.

Cons

  • Requires labeled documents: if a table has no header/footers or inconsistent labeling, you must rely more on extractors.
  • Label maintenance: changes in label wording/layout across sources require Label Set updates.
  • Overlapping header text can reduce detection accuracy; avoid vertically overlapping header labels.

How to configure:

First you will need to configure the Labeling Behavior on your Content Type (usually the Content Model). For instructions on how to add and configure a Labeling Behavior, please take a look at our wiki article on the Labeling Behavior.

Collecting Labels

Rather than setting up an extractor to collect the header labels of our table, we can use Label Sets to collect the labels instead. Label Sets are set per Document Type, so depending on how many Document Types you have in your Project, it may take more or less time to set up.

  1. Navigate to your Content Type where the Labeling Behavior is set.
  2. Click over to the "Labels" tab.
  3. If needed, select the Batch you will be working with in your Batch Viewer.
  4. Assign Document Types (Classify) the documents in your Batch.
  5. Navigate to the Data Table in your node tree.
  6. Set the Extract Method to Tabular Layout.
    • This is required for you to be able to see the labels on the "Labels" tab of your Content Type.
  7. Return to the "Labels" tab on your Content Type. You should now see labels available for your Data Table and Data Columns.
  8. Collect the full Header label for the Data Table label. You can do this by clicking your cursor inside the text box next to the Data Column label, clicking the rubber band icon at the top of the Labels panel, and then drawing a box around the header labels of your table in the Document Viewer.
    • While not strictly required, it is considered best practice to always collect a header label for the Data Table label.
    • When you set your Data Column labels, Grooper will only look inside the set Header Label for matches. The Data Table label acts as a parent label for all Data Columns.
  9. Collect the individual column header labels for each Data Column. There are three different ways to do so:
    1. Type in the text of the label on the document.
    2. Double click on the label in the Document Viewer.
    3. Click the rubber band icon and draw a box around the label on the Document Viewer.
  10. Select the second document in your Batch with a different Document Type assigned.
  11. Repeat steps 8-10 until all different Document Types in your Batch have labels.


Setting a Data Column Extractor

Once you have your labels collected, you configure everything else like you would for regular Tabular Layout Extraction. You'll need to configure an extractor on at least one Data Column for Grooper to be able to detect the rows of the table. Without an extractor on a Data Column, Grooper will not be able to detect the rows of the table and so will have no context as to where the table begins and ends.

  1. Select a Data Column in your Node Tree.
  2. Set an extractor on the Value Extractor property to collect values located in the Table's column.
    • Pattern Match is commonly used for Value Extractors on Data Columns, but any extractor can be used.
    • You can also set the Value Extractor to a Reference.
  3. Save your changes.
  4. Set more Value Extractors to the other Data Columns if needed for accurate extraction.

Using Footer labels

Even after setting an extractor on a Data Column, Grooper may not be able to accurately detect rows. Often you might find that Grooper detects a row in a set of data that appears after the table on the document has ended. We can use Footer Labels to tell Grooper where to stop looking for rows, indicating where the table ends.

To add a Footer Label:

  1. Return to the "Labels" tab on the Content Model.
  2. Click inside of the text box next to the Data Table label.
  3. In the Labels panel toolbar at the top, click the "Add a New Label" icon.
  4. Click "Add Footer".
  5. You should see a new label as a child of the Data Table Label.
  6. Collect a text segment on the document for the Footer Label that will indicate where the table ends.
  7. Test your extraction to verify accuracy.

The scope of the Footer Label is important. Make sure you add your Footer Label as a child of the Data Table rather than the Data Model.