Tabular Layout (Table Extract Method): Difference between revisions

From Grooper Wiki
No edit summary
Line 62: Line 62:


'''Step 1: Create the Data Elements and select the Extract Method'''
'''Step 1: Create the Data Elements and select the Extract Method'''
It is assumed that you already have a Project set up in Grooper with a [[Content Model]], [[Document Type]], and [[Data Model]] already created in Grooper before following these instructions.  
It is assumed that you already have a Project set up in Grooper with a [[Content Model]], [[Document Type]], and [[Data Model]] already created in Grooper before following these instructions.  


Line 121: Line 120:


<div style="position: relative; box-sizing: content-box; max-height: 80vh; max-height: 80svh; width: 100%; aspect-ratio: 1.7777777777777777; padding: 40px 0 40px 0;"><iframe src="https://app.supademo.com/embed/cmhw5fn1700jf0h0i92lh0xgd?embed_v=2&utm_source=embed" loading="lazy" title="03 Assigning Value Extractors to Data Columns" allow="clipboard-write" frameborder="0" webkitallowfullscreen="true" mozallowfullscreen="true" allowfullscreen style="position: absolute; top: 0; left: 0; width: 100%; height: 100%;"></iframe></div>
<div style="position: relative; box-sizing: content-box; max-height: 80vh; max-height: 80svh; width: 100%; aspect-ratio: 1.7777777777777777; padding: 40px 0 40px 0;"><iframe src="https://app.supademo.com/embed/cmhw5fn1700jf0h0i92lh0xgd?embed_v=2&utm_source=embed" loading="lazy" title="03 Assigning Value Extractors to Data Columns" allow="clipboard-write" frameborder="0" webkitallowfullscreen="true" mozallowfullscreen="true" allowfullscreen style="position: absolute; top: 0; left: 0; width: 100%; height: 100%;"></iframe></div>
== Using a header row extractor ==
A '''Header Row''' is the line (or lines) at the top of a table that contains the column labels, such as "Item No.", "Description", "Qty.", or "Total". In document processing, the header row provides essential context for identifying and aligning data in each column. Accurate header detection ensures that extracted values are mapped to the correct [[Data Column]]s, even when table layouts vary between documents.
A '''Header Row Extractor''' is a specialized [[Value Extractor]] that detects the entire header row at once, rather than relying on individual header extractors for each column. This approach is especially useful when:
* Table headers span multiple lines or have complex formatting.
* Column order varies between documents.
* You want to simplify configuration and improve robustness for tables with dynamic layouts.
Using a Header Row Extractor can reduce manual setup, improve extraction accuracy, and make your solution more adaptable to different document types.
'''Why use a Header Row Extractor?'''
* Use when header rows follow a predictable format or pattern.
* Can configure a Headers in one place rather than having to set Header Extractors on each Data Column.
* Using a Header Row Extractor can potentially throw out false positive column matches.
* Can be a better way to take advantage of fuzzy RegEx.
=== Creating the Extractor ===
There are two main methods for creating a Header Row Extractor in Grooper:
# '''Using Named Groups''' in a pattern-based extractor.
# '''Using Child Extractors''' within a [[Data Type]].
Either one will give you the same result. Which one you choose is completely your preference.
==== Using Named Groups ====
A '''Named Group''' is part of a regular expression pattern that captures a specific portion of text and assigns it a name. In Grooper, named groups are used to map header labels directly to [[Data Column]]s.
*'''Named Group Syntax:'''
<pre>
(?<Named_Group>RegEx Pattern)
</pre>
For example, to capture a header row with "Item No.", "Description", and "Qty.", you might use:
<pre>
(?<ItemNo>Item\s*No\.?)\s+(?<Description>Description)\s+(?<Qty>Qty\.?)
</pre>
*'''Step-by-Step: Configuring a Header Row Extractor Using Named Groups'''
# Create a new Extractor Object such as a [[Data Type]] or [[Value Reader]].
# Set the Value Extractor on the Extractor Object to a [[Pattern Match]].
# Write a Regular Expression to return the full header row of the table.
# In the extractor's pattern, use named groups for each individual column header you want to detect.
#* Ensure each named group matches the corresponding column label in your table.
#* If a Data Column has a name with a space in it, use an underscore in place of the space in the group name.
<div style="position: relative; box-sizing: content-box; max-height: 80vh; max-height: 80svh; width: 100%; aspect-ratio: 1.7777777777777777; padding: 40px 0 40px 0;"><iframe src="https://app.supademo.com/embed/cmi4seqx304400z0isqgy56h3?embed_v=2&utm_source=embed" loading="lazy" title="04 Using Named Groups" allow="clipboard-write" frameborder="0" webkitallowfullscreen="true" mozallowfullscreen="true" allowfullscreen style="position: absolute; top: 0; left: 0; width: 100%; height: 100%;"></iframe></div>

Revision as of 17:20, 18 November 2025

WIP

This article is a work-in-progress or created as a placeholder for testing purposes. This article is subject to change and/or expansion. It may be incomplete, inaccurate, or stop abruptly.

This tag will be removed upon draft completion.


This article is about the current version of Grooper.

Note that some content may still need to be updated.

20252024 20232021

Tabular Layout is a Table Extract Method that uses column header values determined by the view_column Data Columns Header Extractor results (or labels collected for the Data Columns when a Labeling Behavior is enabled) as well as Data Column Value Extractor results to model a table's structure and return its values.

Introduction

The Tabular Layout Table Extract Method is a powerful tool in Grooper for extracting structured tabular data from documents. It automatically detects table headers, rows, and footers using a combination of value extractors and layout analysis. Tabular Layout is ideal for documents where tables are clearly defined, such as invoices, statements, and reports.

Unlike other Table Extraction Methods (such as Row Match or Delimited Extract), Tabular Layout leverages header and footer labels, supports multi-line and stacked layouts, and provides advanced configuration for handling complex table structures.

When to use

Tabular Layout is best used when:

  • Tables have clearly defined headers and rows.
  • You need to extract data from grid-based tables, including those with merged or stacked cells.
  • Tables may span multiple pages or regions.

Example: Use Tabular Layout to extract line items from an invoice where each row contains "Quantity," "Description," "Unit Price," and "Total," and headers are present.

Drawbacks:

  • Tabular Layout may be less effective for highly irregular tables or lists without clear headers.
  • For simple delimited data (e.g., CSV), Delimited Extract may be more efficient.
  • Requires well-defined header labels or extractors for best results.

What is a table?

A table in document processing is a structured arrangement of data in rows and columns. Its main components are:

  • Headers: The top section that labels each column (e.g., "Quantity", "Description").


  • Rows: Horizontal groupings of related data, each representing a record or item.


  • Columns: Vertical divisions, each capturing a specific type of data (e.g., price, date).


  • Footers: The bottom section, often used for totals or summary information.


Common use cases for tables in documents include:

  • Invoice line items
  • Transaction logs
  • Product lists
  • Financial summaries

Basic setup

Grooper must be able to detect the columns and rows of a table to extract data. The Tabular Layout does this by identifying the column headers, which indicates where the columns are located on the document. Then at least one Value Extractor must be set on a Data Column that will return a result on each row of the table, giving Grooper context for where the rows of the table are located.

Step 1: Create the Data Elements and select the Extract Method It is assumed that you already have a Project set up in Grooper with a Content Model, Document Type, and Data Model already created in Grooper before following these instructions.

  1. Right click on your Data Model.
  2. Hover over "Add" and select "Data Table..." from the fly out menu.
  3. When the "Add" window appears, enter a name for your Data Table in the Name property.
  4. When satisfied with the naming, click "Execute" to add the Data Table.
  5. Add the Data Columns as children of the Data Table using one of the following methods:
    • One at a time
      1. Right-click on the Data Table.
      2. Hover-over "Add" and select "Data Column..." from the fly out menu.
      3. when the "Add" window appears, enter a name for your Data Column in the Name property.
      4. When satisfied, click "Execute" to create the Data Column.
      5. Repeat steps 1-4 to add as many Data Columns as you would like.
    • Multiple at once
      1. Right-click on the Data Table.
      2. Hover-over "Contents" and click "Add Multiple Items..." from the fly out menu.
      3. When the "Add Multiple Items" window appears, make sure the Item Type property is set to Data Column.
      4. Click the "..." icon to the right of the Item Names property.
      5. When the Item Names window appears, type in the names you want to give to the Data Columns in the text box. Hit enter after each name.
      6. When finished, click "OK".
      7. Back on the Add Multiple Items window, click "Execute" to create the Data Columns.
  6. Next, select the Data Table in your Node Tree.
  7. Click the "☰" to the right of the Extract Method property.
  8. Click on "Tabular Layout" in the drop out menu.
  9. Click the save icon at the top of the property grid to save your changes to the Data Table.

Step 2: Configure Header Extractors

Now that we have the Data Table Extractor set to Tabular Layout, we need to give Grooper some information to determine where tables are located on a document. To do this, we first need to define where the columns of the table are. We can do this by setting a Header Extractor for each column header on each Data Column node under the Data Table.

  1. Select the first Data Column under your Data Table in your node tree.
  2. Locate the Header Extractor property in the property grid and click the "☰" to the right of the property to access the drop down.
  3. Select an Extractor to use to extract the header of the column.
  4. Configure that extractor to return the text of the corresponding column header on the document.
  5. Save any changes made to the Data Column.
  6. Repeat steps 1-5 for each Data Column in your Data Table.


Step 3: Assign Value Extractors to Data Columns

The Data Table Extract Method has been set to Tabular Layout and Headers defined on each Data Column with a Header Extractor. Next, we need to give Grooper a little more context to figure out where the rows of our table are. We do this by setting Value Extractors on the Data Columns.

You will need to set a Value Extractor on at least one Data Column. Grooper will use that extractor to determine where the rows in your table are. If you are having issues with Grooper detecting the rows of the table accurately, you can add Value Extractors to other Data Columns to give Grooper more to work with. There are other ways to improve accuracy, which are discussed later in the article.

To set a Value Extractor on a Data Column, follow these instructions:

  1. Select a Data Column in your Node Tree.
  2. Set an extractor on the Value Extractor property to collect values located in the Table's column.
    • Pattern Match is commonly used for Value Extractors on Data Columns, but any extractor can be used.
    • You can also set the Value Extractor to a Reference.
  3. Save your changes.
  4. Set more Value Extractors to the other Data Columns if needed for accurate extraction.

Using a header row extractor

A Header Row is the line (or lines) at the top of a table that contains the column labels, such as "Item No.", "Description", "Qty.", or "Total". In document processing, the header row provides essential context for identifying and aligning data in each column. Accurate header detection ensures that extracted values are mapped to the correct Data Columns, even when table layouts vary between documents.


A Header Row Extractor is a specialized Value Extractor that detects the entire header row at once, rather than relying on individual header extractors for each column. This approach is especially useful when:

  • Table headers span multiple lines or have complex formatting.
  • Column order varies between documents.
  • You want to simplify configuration and improve robustness for tables with dynamic layouts.

Using a Header Row Extractor can reduce manual setup, improve extraction accuracy, and make your solution more adaptable to different document types.

Why use a Header Row Extractor?

  • Use when header rows follow a predictable format or pattern.
  • Can configure a Headers in one place rather than having to set Header Extractors on each Data Column.
  • Using a Header Row Extractor can potentially throw out false positive column matches.
  • Can be a better way to take advantage of fuzzy RegEx.

Creating the Extractor

There are two main methods for creating a Header Row Extractor in Grooper:

  1. Using Named Groups in a pattern-based extractor.
  2. Using Child Extractors within a Data Type.

Either one will give you the same result. Which one you choose is completely your preference.

Using Named Groups

A Named Group is part of a regular expression pattern that captures a specific portion of text and assigns it a name. In Grooper, named groups are used to map header labels directly to Data Columns.

  • Named Group Syntax:
(?<Named_Group>RegEx Pattern)

For example, to capture a header row with "Item No.", "Description", and "Qty.", you might use:

(?<ItemNo>Item\s*No\.?)\s+(?<Description>Description)\s+(?<Qty>Qty\.?)
  • Step-by-Step: Configuring a Header Row Extractor Using Named Groups
  1. Create a new Extractor Object such as a Data Type or Value Reader.
  2. Set the Value Extractor on the Extractor Object to a Pattern Match.
  3. Write a Regular Expression to return the full header row of the table.
  4. In the extractor's pattern, use named groups for each individual column header you want to detect.
    • Ensure each named group matches the corresponding column label in your table.
    • If a Data Column has a name with a space in it, use an underscore in place of the space in the group name.