Tabular Layout (Table Extract Method): Difference between revisions

From Grooper Wiki
No edit summary
 
(28 intermediate revisions by 2 users not shown)
Line 1: Line 1:
{|class="wip-box"
{{AutoVersion}}
 
<blockquote>{{#lst:Glossary|Tabular Layout}}</blockquote>
 
{|class="download-box"
|
|
'''WIP'''
[[File:Asset 22@4x.png]]
|
|
This article is a work-in-progress or created as a placeholder for testing purposes. This article is subject to change and/or expansion. It may be incomplete, inaccurate, or stop abruptly.
You may download the ZIP(s) below and upload it into your own Grooper environment (version 2025). The first contains one or more '''Batches''' of sample documents. The second contains one or more '''Projects''' with resources used in examples throughout this article.
 
* [[Media:2025 Wiki Tabular Layout Batches.zip]]
This tag will be removed upon draft completion.
* [[Media:2025 Wiki Tabular Layout Project.zip]]
|}
|}
{{AutoVersion}}
<blockquote>{{#lst:Glossary|Tabular Layout}}</blockquote>


== Introduction ==
== Introduction ==
Line 27: Line 27:
'''Drawbacks:'''
'''Drawbacks:'''
* Tabular Layout may be less effective for highly irregular tables or lists without clear headers.
* Tabular Layout may be less effective for highly irregular tables or lists without clear headers.
* For simple delimited data (e.g., CSV), Delimited Extract may be more efficient.
* For simple delimited data (e.g., CSV), [[Delimited Extract]] may be more efficient.
* Requires well-defined header labels or extractors for best results.
* Requires well-defined header labels or extractors for best results.


Line 57: Line 57:
* Product lists
* Product lists
* Financial summaries
* Financial summaries
== Basic setup ==
Grooper must be able to detect the columns and rows of a table to extract data. The Tabular Layout does this by identifying the column headers, which indicates where the columns are located on the document. Then at least one Value Extractor must be set on a [[Data Column]] that will return a result on each row of the table, giving Grooper context for where the rows of the table are located.
'''Step 1: Create the Data Elements and select the Extract Method'''
It is assumed that you already have a Project set up in Grooper with a [[Content Model]], [[Document Type]], and [[Data Model]] already created in Grooper before following these instructions.
# Right click on your Data Model.
# Hover over "Add" and select "Data Table..." from the fly out menu.
# When the "Add" window appears, enter a name for your [[Data Table]] in the Name property.
# When satisfied with the naming, click "Execute" to add the Data Table.
# Add the Data Columns as children of the Data Table using one of the following methods:
#* '''One at a time'''
#*# Right-click on the Data Table.
#*# Hover-over "Add" and select "Data Column..." from the fly out menu.
#*# when the "Add" window appears, enter a name for your Data Column in the Name property.
#*# When satisfied, click "Execute" to create the Data Column.
#*# Repeat steps 1-4 to add as many Data Columns as you would like.
#* '''Multiple at once'''
#*# Right-click on the Data Table.
#*# Hover-over "Contents" and click "Add Multiple Items..." from the fly out menu.
#*# When the "Add Multiple Items" window appears, make sure the Item Type property is set to Data Column.
#*# Click the "..." icon to the right of the Item Names property.
#*# When the Item Names window appears, type in the names you want to give to the Data Columns in the text box. Hit enter after each name.
#*# When finished, click "OK".
#*# Back on the Add Multiple Items window, click "Execute" to create the Data Columns.
# Next, select the Data Table in your Node Tree.
# Click the "☰" to the right of the Extract Method property.
# Click on "Tabular Layout" in the drop out menu.
# Click the save icon at the top of the property grid to save your changes to the Data Table.
<div style="position: relative; box-sizing: content-box; max-height: 80vh; max-height: 80svh; width: 100%; aspect-ratio: 1.7777777777777777; padding: 40px 0 40px 0;"><iframe src="https://app.supademo.com/embed/cmhktdxz400et020j43kzjn0q?embed_v=2&utm_source=embed" loading="lazy" title="01 Add Data Elements and Set Extract Method" allow="clipboard-write" frameborder="0" webkitallowfullscreen="true" mozallowfullscreen="true" allowfullscreen style="position: absolute; top: 0; left: 0; width: 100%; height: 100%;"></iframe></div>
'''Step 2: Configure Header Extractors'''
Now that we have the Data Table Extractor set to Tabular Layout, we need to give Grooper some information to determine where tables are located on a document. To do this, we first need to define where the columns of the table are. We can do this by setting a Header Extractor for each column header on each [[Data Column]] node under the Data Table.
# Select the first Data Column under your Data Table in your node tree.
# Locate the Header Extractor property in the property grid and click the "☰" to the right of the property to access the drop down.
# Select an Extractor to use to extract the header of the column.
# Configure that extractor to return the text of the corresponding column header on the document.
# Save any changes made to the Data Column.
# Repeat steps 1-5 for each Data Column in your Data Table.
<div style="position: relative; box-sizing: content-box; max-height: 80vh; max-height: 80svh; width: 100%; aspect-ratio: 1.7777777777777777; padding: 40px 0 40px 0;"><iframe src="https://app.supademo.com/embed/cmhl5a5pl00d1yb0klng8gs3s?embed_v=2&utm_source=embed" loading="lazy" title="02 Setting the Header Extractors" allow="clipboard-write" frameborder="0" webkitallowfullscreen="true" mozallowfullscreen="true" allowfullscreen style="position: absolute; top: 0; left: 0; width: 100%; height: 100%;"></iframe></div>
'''Step 3: Assign Value Extractors to Data Columns'''
The Data Table Extract Method has been set to Tabular Layout and Headers defined on each Data Column with a Header Extractor. Next, we need to give Grooper a little more context to figure out where the rows of our table are. We do this by setting Value Extractors on the Data Columns.
You will need to set a Value Extractor on at least one Data Column. Grooper will use that extractor to determine where the rows in your table are. If you are having issues with Grooper detecting the rows of the table accurately, you can add Value Extractors to other Data Columns to give Grooper more to work with. There are other ways to improve accuracy, which are discussed later in the article.
To set a Value Extractor on a Data Column, follow these instructions:
# Select a Data Column in your Node Tree.
# Set an extractor on the Value Extractor property to collect values located in the Table's column.
#* [[Pattern Match]] is commonly used for Value Extractors on Data Columns, but any extractor can be used.
#* You can also set the Value Extractor to a [[Reference]].
# Save your changes.
# Set more Value Extractors to the other Data Columns if needed for accurate extraction.
<div style="position: relative; box-sizing: content-box; max-height: 80vh; max-height: 80svh; width: 100%; aspect-ratio: 1.7777777777777777; padding: 40px 0 40px 0;"><iframe src="https://app.supademo.com/embed/cmhw5fn1700jf0h0i92lh0xgd?embed_v=2&utm_source=embed" loading="lazy" title="03 Assigning Value Extractors to Data Columns" allow="clipboard-write" frameborder="0" webkitallowfullscreen="true" mozallowfullscreen="true" allowfullscreen style="position: absolute; top: 0; left: 0; width: 100%; height: 100%;"></iframe></div>
== Using a Header Row Extractor ==
A '''Header Row''' is the line (or lines) at the top of a table that contains the column labels, such as "Item No.", "Description", "Qty.", or "Total". In document processing, the header row provides essential context for identifying and aligning data in each column. Accurate header detection ensures that extracted values are mapped to the correct [[Data Column]]s, even when table layouts vary between documents.
A '''Header Row Extractor''' is a specialized [[Value Extractor]] that detects the entire header row at once, rather than relying on individual header extractors for each column. This approach is especially useful when:
* Table headers span multiple lines or have complex formatting.
* Column order varies between documents.
* You want to simplify configuration and improve robustness for tables with dynamic layouts.
Using a Header Row Extractor can reduce manual setup, improve extraction accuracy, and make your solution more adaptable to different document types.
'''Why use a Header Row Extractor?'''
* Use when header rows follow a predictable format or pattern.
* Can configure a Headers in one place rather than having to set Header Extractors on each Data Column.
* Using a Header Row Extractor can potentially throw out false positive column matches.
* Can be a better way to take advantage of fuzzy RegEx.
=== Creating the Extractor ===
There are two main methods for creating a Header Row Extractor in Grooper:
# '''Using Named Groups''' in a pattern-based extractor.
# '''Using Child Extractors''' within a [[Data Type]].
Either one will give you the same result. Which one you choose is completely your preference.
==== Using Named Groups ====
A '''Named Group''' is part of a regular expression pattern that captures a specific portion of text and assigns it a name. In Grooper, named groups are used to map header labels directly to Data Columns.
*'''Named Group Syntax:'''
<pre>
(?<Named_Group>RegEx Pattern)
</pre>
For example, to capture a header row with "Item No.", "Description", and "Qty.", you might use:
<pre>
(?<ItemNo>Item\s*No\.?)\s+
(?<Description>Description)\s+
(?<Qty>Qty\.?)
</pre>
*'''Step-by-Step: Configuring a Header Row Extractor Using Named Groups'''
# Create a new Extractor Object such as a [[Data Type]] or [[Value Reader]].
# Set the Value Extractor on the Extractor Object to a [[Pattern Match]].
# Write a Regular Expression to return the full header row of the table.
# In the extractor's pattern, use named groups for each individual column header you want to detect.
#* Ensure each named group matches the corresponding column label in your table.
#* If a Data Column has a name with a space in it, use an underscore in place of the space in the group name.
The following RegEx pattern is used in the example below. If you would like to follow along, feel free to copy out the RegEx.
<pre>
(?<Description>DESCRIPTION)\t
(?<Quantity>HRS / QTY)\t
(?<Unit_Price>RATE / PRICE)\t
(?<Line_Total>SUBTOTAL)
</pre>
<div style="position: relative; box-sizing: content-box; max-height: 80vh; max-height: 80svh; width: 100%; aspect-ratio: 1.7777777777777777; padding: 40px 0 40px 0;"><iframe src="https://app.supademo.com/embed/cmi4seqx304400z0isqgy56h3?embed_v=2&utm_source=embed" loading="lazy" title="04 Using Named Groups" allow="clipboard-write" frameborder="0" webkitallowfullscreen="true" mozallowfullscreen="true" allowfullscreen style="position: absolute; top: 0; left: 0; width: 100%; height: 100%;"></iframe></div>
==== Using Named Child Extractors ====
Alternatively, you can use a [[Data Type]] with multiple child extractors—one for each header label. This method is ideal when header labels are complex or require different extraction logic.
*'''How It Works:'''
- Each child extractor is configured to match a specific header label.
- The [[Collation Provider]] can be used to collate results from child extractors, referencing each by name.
- The combined extractor is assigned as the Header Row Extractor.
*'''Step-by-Step: Configuring a Header Row Extractor Using Child Extractors'''
# Create a new [[Data Type]] for the header row.
# Add a child extractor object (Data Type or Value Reader) for each column header (e.g., "Description", "Qty.", "Unit Price").
#* Ensure that each child extractor object is named exactly the same as the Data Column it corresponds with.
# Use a [[Collation Provider]] such as [[Ordered Array]] to collate the child extractors into a single result.
<div style="position: relative; box-sizing: content-box; max-height: 80vh; max-height: 80svh; width: 100%; aspect-ratio: 1.7777777777777777; padding: 40px 0 40px 0;"><iframe src="https://app.supademo.com/embed/cmi63sm2k01aw280hq1hq7x6d?embed_v=2&utm_source=embed" loading="lazy" title="05 Using Named Child Extractors" allow="clipboard-write" frameborder="0" webkitallowfullscreen="true" mozallowfullscreen="true" allowfullscreen style="position: absolute; top: 0; left: 0; width: 100%; height: 100%;"></iframe></div>
=== How to Set the Header Row Extractor in Tabular Layout ===
Once you have created a Row Extractor, you need to set it on the Header Row Extractor Property.
To use your Header Row Extractor in Tabular Layout:
# Select the [[Data Table]] node in your node tree.
# Set the "Extract Method" property to '''Tabular Layout''' if not already set.
# Expand the "Header Detection" property and locate "Header Row Extractor".
# Set the Header Row Extractor to a Reference.
# Assign your configured extractor (either with named groups or as a Data Type with child extractors) to the reference.
# Save your changes and run a test extraction to confirm the Header Row Extractor works.
<div style="position: relative; box-sizing: content-box; max-height: 80vh; max-height: 80svh; width: 100%; aspect-ratio: 1.7777777777777777; padding: 40px 0 40px 0;"><iframe src="https://app.supademo.com/embed/cmi65d81104qm090i2boy76ng?embed_v=2&utm_source=embed" loading="lazy" title="06 Setting the Header Row Extractor" allow="clipboard-write" frameborder="0" webkitallowfullscreen="true" mozallowfullscreen="true" allowfullscreen style="position: absolute; top: 0; left: 0; width: 100%; height: 100%;"></iframe></div>
=== Problems with unlined tables ===
When working with a table that does not have lines giving context to where the rows and columns of the tables are, Grooper has a harder time figuring out where columns are. Using the named child extractors method of configuring a Header Row Extractor relies on lines to determine columns. When you do not have lines, you can run into issues with your extraction.
If you have a table without lines on your documents that you want extracted, it is recommended to use a different method for extracting the data. You may want to set Header Extractors on each Data Column, or look into using Label Sets to detect headers.
<div style="position: relative; box-sizing: content-box; max-height: 80vh; max-height: 80svh; width: 100%; aspect-ratio: 1.7777777777777777; padding: 40px 0 40px 0;"><iframe src="https://app.supademo.com/embed/cmi6eipca03750o0imf9k9zv9?embed_v=2&utm_source=embed" loading="lazy" title="07 Problems with Unlined Tables" allow="clipboard-write" frameborder="0" webkitallowfullscreen="true" mozallowfullscreen="true" allowfullscreen style="position: absolute; top: 0; left: 0; width: 100%; height: 100%;"></iframe></div>
== Using label sets within Tabular Layout ==
A [[Label Sets|Label Set]] is a group of labels associated with a specific Document Type. Each label represents a possible way a data element might be named or presented in a document. For example, a Label Set for invoices might include "Invoice Number", "Inv #", and "Bill No.", all mapped to the same Data Field.
Label Sets are managed using the "Labels" tab on the Design Page for any Content Type with a [[Labeling Behavior]] enabled.
=== Why use Label Sets ===
Label Sets enable header- and footer-driven detection for tables. Tabular Layout will:
*Read table headers from the Label Set to locate and align columns (''Header Detection'')
*Read optional footer labels to establish the end of the table (''Footer Detection'').
This approach is ideal for semi-structured documents where the same data appears with different labels or column order.
''' How Tabular Layout uses Label Sets '''
* '''Header detection:''' The engine reads table and column labels from the Label Set to build a '''Table Header Collection''' and snap header cells to geometric bounds. This improves alignment for value extraction across rows.
* '''Footer detection:''' When a table's '''Footer''' label exists, it establishes the table's end line. Tabular Layout stops row detection above the footer and can optionally capture the footer row as data.
* '''Column alignment:''' With labels present, columns are aligned to their labeled header bounds—even when column order varies—yielding consistent cell extraction.
'''Tips'''
* Prefer header labels that cover the full header cell without vertical overlap.
* Use "Dynamic Column Ordering" on the [[Data Table]] when documents rearrange columns.
* For unlabeled or irregular tables, rely more on each column’s "Value Extractor" and the '''Tabular Layout Options''' fallback modes.
'''Benefits:'''
* Rapid onboarding of new document types.
* Increased extraction accuracy for tables with variable layouts.
* Enables label-driven classification and extraction.
'''Drawbacks:'''
* Requires consistent labeling on documents.
* May need supplemental extractors for unlabeled data.
=== Pros and cons vs. traditional Tabular Layout ===
'''Pros'''
* '''Faster onboarding:''' define labels once per Content Type; minimal per-document tuning.
* '''Higher accuracy:''' header cells and footer rows are detected via label text (reduces false positives).
* '''Supports dynamic column order:''' columns are aligned to their labeled headers rather than fixed positions.
* Works with multi-line/stacked headers when labels identify the full header region.
'''Cons'''
* '''Requires labeled documents:''' if a table has no header/footers or inconsistent labeling, you must rely more on extractors.
* '''Label maintenance:''' changes in label wording/layout across sources require Label Set updates.
* Overlapping header text can reduce detection accuracy; avoid vertically overlapping header labels.
=== How to configure: ===
First you will need to configure the Labeling Behavior on your Content Type (usually the Content Model). For instructions on how to add and configure a Labeling Behavior, please take a look at our wiki article on the [[Labeling Behavior]].
==== Collecting Labels ====
Rather than setting up an extractor to collect the header labels of our table, we can use Label Sets to collect the labels instead. Label Sets are set per Document Type, so depending on how many Document Types you have in your Project, it may take more or less time to set up.
# Navigate to your Content Type where the Labeling Behavior is set.
# Click over to the "Labels" tab.
# If needed, select the Batch you will be working with in your Batch Viewer.
# Assign Document Types (Classify) the documents in your Batch.
# Navigate to the Data Table in your node tree.
# Set the Extract Method to Tabular Layout.
#* This is required for you to be able to see the labels on the "Labels" tab of your Content Type.
# Return to the "Labels" tab on your Content Type. You should now see labels available for your Data Table and Data Columns.
# Collect the full Header label for the Data Table label. You can do this by clicking your cursor inside the text box next to the Data Column label, clicking the rubber band icon at the top of the Labels panel, and then drawing a box around the header labels of your table in the Document Viewer.
#* While not strictly required, it is considered best practice to always collect a header label for the Data Table label.
#* When you set your Data Column labels, Grooper will only look inside the set Header Label for matches. The Data Table label acts as a parent label for all Data Columns.
# Collect the individual column header labels for each Data Column. There are three different ways to do so:
## Type in the text of the label on the document.
## Double click on the label in the Document Viewer.
## Click the rubber band icon and draw a box around the label on the Document Viewer.
# Select the second document in your Batch with a different Document Type assigned.
# Repeat steps 8-10 until all different Document Types in your Batch have labels.
<div style="position: relative; box-sizing: content-box; max-height: 80vh; max-height: 80svh; width: 100%; aspect-ratio: 1.7777777777777777; padding: 40px 0 40px 0;"><iframe src="https://app.supademo.com/embed/cmiypwtky00eh220jrszv9f91?embed_v=2&utm_source=embed" loading="lazy" title="08 Collecting Table Labels" allow="clipboard-write" frameborder="0" webkitallowfullscreen="true" mozallowfullscreen="true" allowfullscreen style="position: absolute; top: 0; left: 0; width: 100%; height: 100%;"></iframe></div>
==== Setting a Data Column Extractor ====
Once you have your labels collected, you configure everything else like you would for regular Tabular Layout Extraction. You'll need to configure an extractor on at least one Data Column for Grooper to be able to detect the rows of the table. Without an extractor on a Data Column, Grooper will not be able to detect the rows of the table and so will have no context as to where the table begins and ends.
# Select a Data Column in your Node Tree.
# Set an extractor on the Value Extractor property to collect values located in the Table's column.
#* [[Pattern Match]] is commonly used for Value Extractors on Data Columns, but any extractor can be used.
#* You can also set the Value Extractor to a [[Reference]].
# Save your changes.
# Set more Value Extractors to the other Data Columns if needed for accurate extraction.
<div style="position: relative; box-sizing: content-box; max-height: 80vh; max-height: 80svh; width: 100%; aspect-ratio: 1.7777777777777777; padding: 40px 0 40px 0;"><iframe src="https://app.supademo.com/embed/cmj04m41k006dzm0iad53hatc?embed_v=2&utm_source=embed" loading="lazy" title="09 Setting a Data Column Extractor" allow="clipboard-write" frameborder="0" webkitallowfullscreen="true" mozallowfullscreen="true" allowfullscreen style="position: absolute; top: 0; left: 0; width: 100%; height: 100%;"></iframe></div>
== Advanced Configuration Options ==
Sometimes when setting up your Tabular Layout Extractor, you may run into situations where the basics just aren't enough to capture the data you're looking for. There are a few extra properties you can configure to improve your results.
=== Maximum Header Distance ===
'''Maximum Header Distance''' controls how far below the detected table header the first row is allowed to appear and still be considered part of the table. The distance is measured as a multiple of the average line height. If the first candidate row is farther than this limit, it will not be treated as part of the table.
==== What it does ====
* Ensures the first row starts close enough to the header.
* Helps avoid mistakenly including unrelated content (decorative lines, spacing, notes) as part of the table.
* When set to '''0''', header proximity is not required (useful when rows begin immediately below or header detection is unreliable).
==== Configuration ====
# Open the [[Data Table]] that uses '''Tabular Layout'''.
# In "Row Detection", set '''Maximum Header Distance''' to a value that fits your layout:
#* '''1.0 or 100%''' — first row can begin up to one line height below the header (strict).
#* '''2.0 or 200%''' — allows a blank line or extra spacing.
#* Higher values — tolerate more vertical space before the first row. 2.0-3.0 (200%-300%) is often enough to capture the rows of the table without returning false positives above the header.
# After adjusting the property, run extraction on a sample document.
# The first data row should be detected directly under the header within the allowed distance.
# If no rows are detected (too strict), increase the value incrementally.
# If unrelated content is being treated as the first row (too permissive), decrease the value.
'''Tips'''
* Keep the value as low as possible while accommodating typical spacing on your documents.
* Pair with "Minimum Cell Count" and per-column '''Tabular Layout Options''' (mark critical columns as '''Required''') to further reduce false positives.
<div style="position: relative; box-sizing: content-box; max-height: 80vh; max-height: 80svh; width: 100%; aspect-ratio: 1.78; padding: 40px 0 40px 0;"><iframe src="https://app.supademo.com/embed/cmj05xz7c01vj0x0istvuo8ha?embed_v=2&utm_source=embed" loading="lazy" title="11 Maximum Header Distance" allow="clipboard-write" frameborder="0" webkitallowfullscreen="true" mozallowfullscreen="true" allowfullscreen style="position: absolute; top: 0; left: 0; width: 100%; height: 100%;"></iframe></div>
=== Using Footer labels ===
Even after setting an extractor on a Data Column, Grooper may not be able to accurately detect rows. Often you might find that Grooper detects a row in a set of data that appears after the table on the document has ended. We can use Footer Labels to tell Grooper where to stop looking for rows, indicating where the table ends.
==== Without Label Sets ====
To add a Footer Label when you are not using Label Sets:
# Set up your Tabular Layout and make sure you have your Header Detection Extractor configured.
# Navigate to the Data Table and expand the Extract Method sub properties.
# Locate the Footer Detection property.
# Configure an extractor for the Foot Detection property to return a text segment on the document that appears shortly after the table ends.
#* Use any extractor you wish. In our example below we use a List Match.
# If desired, turn the Capture Footer Row property to False.
#* If set to ''True'', you will see a blank row at the end of your extraction when you test to indicate that a Footer is being used.
#* If set to ''False'', the blank row will be hidden during extraction.
# Test your extraction.
<div style="position: relative; box-sizing: content-box; max-height: 80vh; max-height: 80svh; width: 100%; aspect-ratio: 1.78; padding: 40px 0 40px 0;"><iframe src="https://app.supademo.com/embed/cmj0b8ji90022x70hs8vf8joe?embed_v=2&utm_source=embed" loading="lazy" title="11 Footer Detection" allow="clipboard-write" frameborder="0" webkitallowfullscreen="true" mozallowfullscreen="true" allowfullscreen style="position: absolute; top: 0; left: 0; width: 100%; height: 100%;"></iframe></div>
==== With Label Sets ====
To add a Footer Label:
# Return to the "Labels" tab on the Content Model.
# Click inside of the text box next to the Data Table label.
# In the Labels panel toolbar at the top, click the "Add a New Label" icon.
# Click "Add Footer".
# You should see a new label as a child of the Data Table Label.
# Collect a text segment on the document for the Footer Label that will indicate where the table ends.
# Test your extraction to verify accuracy.
{|class="attn-box"
|
|
The scope of the Footer Label is important. Make sure you add your Footer Label as a child of the Data Table rather than the Data Model.
|}
<div style="position: relative; box-sizing: content-box; max-height: 80vh; max-height: 80svh; width: 100%; aspect-ratio: 1.78; padding: 40px 0 40px 0;"><iframe src="https://app.supademo.com/embed/cmj05zuar01zh0x0iny5soeew?embed_v=2&utm_source=embed" loading="lazy" title="10 Footer Labels" allow="clipboard-write" frameborder="0" webkitallowfullscreen="true" mozallowfullscreen="true" allowfullscreen style="position: absolute; top: 0; left: 0; width: 100%; height: 100%;"></iframe></div>
== Properties overview ==
Below is a comprehensive list of Tabular Layout properties, including their definitions, remarks, and use cases.
=== Tabular Layout Properties ===
; '''Table Style'''
: Defines the style of table extraction (Normal or Floating).
: ''Normal'' extracts tables spanning the full page width and multiple pages. ''Floating'' confines extraction to a region on a single page.
; '''Header Detection'''
: Configures how table headers are detected, using label sets, value extractors, or column header extractors.
: Use to establish table structure and boundaries.
; '''Row Detection'''
: Controls how table rows are identified, including required columns, minimum cell count, and alignment.
: Fine-tune to improve row segmentation and reduce false positives.
; '''Multiline Rows'''
: Enables detection of multi-line or stacked table rows.
: Use for tables with wrapped text, stacked headers, or free-form notes.
; '''Footer Detection'''
: Assigns a [[Value Extractor]] to identify footer content at the end of the table.
: Prevents extraction beyond the intended table end.
; '''Capture Footer Row'''
: If enabled, includes the detected footer row as a data row in the output table.
: Use when footer contains values to extract (e.g., totals).
=== Table Header Detector Properties ===
; '''Header Row Extractor'''
: Value Extractor for identifying header rows by pattern or content.
: Use for pattern-based header detection.
; '''Minimum Cell Count'''
: Minimum number of header cells required for detection.
: Prevents false positives in header recognition.
; '''Maximum Line Spacing'''
: Maximum vertical distance between header lines.
: Adjust for multi-line or spaced headers.
; '''Repair Threshold'''
: Controls how aggressively incomplete headers are repaired.
: Use for documents with inconsistent or missing header cells.
; '''Run Global'''
: Enables global header detection across the document.
: Use for standardized forms with consistent headers.
=== Table Row Detector Properties ===
; '''Minimum Cell Count'''
: Minimum number of column values required for row detection.
: Filters out incomplete or noisy rows.
; '''Maximum Gap'''
: Maximum allowed gap between detected values.
: Adjust for wide or inconsistently spaced tables.
; '''Maximum Header Distance'''
: Maximum allowed distance from header to first row.
: Use for tables with extra space after headers.
; '''Find Column Positions'''
: Dynamically adjusts header cell boundaries based on detected values.
: Improves alignment in variable layouts.
; '''Merge Multiple Instances'''
: Merges multiple detected row regions into a single table.
: Use for multi-page or interrupted tables.
=== Multiline Row Settings Properties ===
; '''Maximum Lines Per Row'''
: Maximum number of text lines per table row.
: Use for wrapped descriptions or notes.
; '''Maximum Leading Lines'''
: Maximum number of leading lines before main row content.
: Include descriptions or comments above rows.
; '''Maximum Line Spacing'''
: Maximum vertical distance between lines in a row.
: Adjust for tightly or widely spaced lines.
; '''Detect Page Wrap'''
: Enables rows to span multiple pages.
: Use for long descriptions or multi-page tables.
; '''Detect Stacked Layout'''
: Enables detection of stacked table layouts.
: Use for tables with multi-level or stacked headers.
=== Tabular Layout Options (per-column) ===
; '''Row Detection'''
: Controls how the column is used in row detection (Required, Optional, Disabled).
: Mark key columns as required for strict row boundaries.
; '''Secondary Extract'''
: Specifies when secondary extraction is performed (Auto, Always, Never).
: Use for fallback extraction in merged or irregular cells.
; '''Secondary Extract Mode'''
: Determines how secondary extraction is performed (Auto, Geometric, CellExtract, RowExtract).
: Select based on table structure and variability.
== See also ==
* [[Data Table]]
* [[Data Column]]
* [[Label Set]]
* [[Value Extractor]]
* [[Table Extract Method]]

Latest revision as of 15:24, 15 December 2025

This article is about the current version of Grooper.

Note that some content may still need to be updated.

20252024 20232021

Tabular Layout is a Table Extract Method that uses column header values determined by the view_column Data Columns Header Extractor results (or labels collected for the Data Columns when a Labeling Behavior is enabled) as well as Data Column Value Extractor results to model a table's structure and return its values.

You may download the ZIP(s) below and upload it into your own Grooper environment (version 2025). The first contains one or more Batches of sample documents. The second contains one or more Projects with resources used in examples throughout this article.

Introduction

The Tabular Layout Table Extract Method is a powerful tool in Grooper for extracting structured tabular data from documents. It automatically detects table headers, rows, and footers using a combination of value extractors and layout analysis. Tabular Layout is ideal for documents where tables are clearly defined, such as invoices, statements, and reports.

Unlike other Table Extraction Methods (such as Row Match or Delimited Extract), Tabular Layout leverages header and footer labels, supports multi-line and stacked layouts, and provides advanced configuration for handling complex table structures.

When to use

Tabular Layout is best used when:

  • Tables have clearly defined headers and rows.
  • You need to extract data from grid-based tables, including those with merged or stacked cells.
  • Tables may span multiple pages or regions.

Example: Use Tabular Layout to extract line items from an invoice where each row contains "Quantity," "Description," "Unit Price," and "Total," and headers are present.

Drawbacks:

  • Tabular Layout may be less effective for highly irregular tables or lists without clear headers.
  • For simple delimited data (e.g., CSV), Delimited Extract may be more efficient.
  • Requires well-defined header labels or extractors for best results.

What is a table?

A table in document processing is a structured arrangement of data in rows and columns. Its main components are:

  • Headers: The top section that labels each column (e.g., "Quantity", "Description").


  • Rows: Horizontal groupings of related data, each representing a record or item.


  • Columns: Vertical divisions, each capturing a specific type of data (e.g., price, date).


  • Footers: The bottom section, often used for totals or summary information.


Common use cases for tables in documents include:

  • Invoice line items
  • Transaction logs
  • Product lists
  • Financial summaries

Basic setup

Grooper must be able to detect the columns and rows of a table to extract data. The Tabular Layout does this by identifying the column headers, which indicates where the columns are located on the document. Then at least one Value Extractor must be set on a Data Column that will return a result on each row of the table, giving Grooper context for where the rows of the table are located.

Step 1: Create the Data Elements and select the Extract Method It is assumed that you already have a Project set up in Grooper with a Content Model, Document Type, and Data Model already created in Grooper before following these instructions.

  1. Right click on your Data Model.
  2. Hover over "Add" and select "Data Table..." from the fly out menu.
  3. When the "Add" window appears, enter a name for your Data Table in the Name property.
  4. When satisfied with the naming, click "Execute" to add the Data Table.
  5. Add the Data Columns as children of the Data Table using one of the following methods:
    • One at a time
      1. Right-click on the Data Table.
      2. Hover-over "Add" and select "Data Column..." from the fly out menu.
      3. when the "Add" window appears, enter a name for your Data Column in the Name property.
      4. When satisfied, click "Execute" to create the Data Column.
      5. Repeat steps 1-4 to add as many Data Columns as you would like.
    • Multiple at once
      1. Right-click on the Data Table.
      2. Hover-over "Contents" and click "Add Multiple Items..." from the fly out menu.
      3. When the "Add Multiple Items" window appears, make sure the Item Type property is set to Data Column.
      4. Click the "..." icon to the right of the Item Names property.
      5. When the Item Names window appears, type in the names you want to give to the Data Columns in the text box. Hit enter after each name.
      6. When finished, click "OK".
      7. Back on the Add Multiple Items window, click "Execute" to create the Data Columns.
  6. Next, select the Data Table in your Node Tree.
  7. Click the "☰" to the right of the Extract Method property.
  8. Click on "Tabular Layout" in the drop out menu.
  9. Click the save icon at the top of the property grid to save your changes to the Data Table.

Step 2: Configure Header Extractors

Now that we have the Data Table Extractor set to Tabular Layout, we need to give Grooper some information to determine where tables are located on a document. To do this, we first need to define where the columns of the table are. We can do this by setting a Header Extractor for each column header on each Data Column node under the Data Table.

  1. Select the first Data Column under your Data Table in your node tree.
  2. Locate the Header Extractor property in the property grid and click the "☰" to the right of the property to access the drop down.
  3. Select an Extractor to use to extract the header of the column.
  4. Configure that extractor to return the text of the corresponding column header on the document.
  5. Save any changes made to the Data Column.
  6. Repeat steps 1-5 for each Data Column in your Data Table.


Step 3: Assign Value Extractors to Data Columns

The Data Table Extract Method has been set to Tabular Layout and Headers defined on each Data Column with a Header Extractor. Next, we need to give Grooper a little more context to figure out where the rows of our table are. We do this by setting Value Extractors on the Data Columns.

You will need to set a Value Extractor on at least one Data Column. Grooper will use that extractor to determine where the rows in your table are. If you are having issues with Grooper detecting the rows of the table accurately, you can add Value Extractors to other Data Columns to give Grooper more to work with. There are other ways to improve accuracy, which are discussed later in the article.

To set a Value Extractor on a Data Column, follow these instructions:

  1. Select a Data Column in your Node Tree.
  2. Set an extractor on the Value Extractor property to collect values located in the Table's column.
    • Pattern Match is commonly used for Value Extractors on Data Columns, but any extractor can be used.
    • You can also set the Value Extractor to a Reference.
  3. Save your changes.
  4. Set more Value Extractors to the other Data Columns if needed for accurate extraction.

Using a Header Row Extractor

A Header Row is the line (or lines) at the top of a table that contains the column labels, such as "Item No.", "Description", "Qty.", or "Total". In document processing, the header row provides essential context for identifying and aligning data in each column. Accurate header detection ensures that extracted values are mapped to the correct Data Columns, even when table layouts vary between documents.

A Header Row Extractor is a specialized Value Extractor that detects the entire header row at once, rather than relying on individual header extractors for each column. This approach is especially useful when:

  • Table headers span multiple lines or have complex formatting.
  • Column order varies between documents.
  • You want to simplify configuration and improve robustness for tables with dynamic layouts.

Using a Header Row Extractor can reduce manual setup, improve extraction accuracy, and make your solution more adaptable to different document types.

Why use a Header Row Extractor?

  • Use when header rows follow a predictable format or pattern.
  • Can configure a Headers in one place rather than having to set Header Extractors on each Data Column.
  • Using a Header Row Extractor can potentially throw out false positive column matches.
  • Can be a better way to take advantage of fuzzy RegEx.

Creating the Extractor

There are two main methods for creating a Header Row Extractor in Grooper:

  1. Using Named Groups in a pattern-based extractor.
  2. Using Child Extractors within a Data Type.

Either one will give you the same result. Which one you choose is completely your preference.

Using Named Groups

A Named Group is part of a regular expression pattern that captures a specific portion of text and assigns it a name. In Grooper, named groups are used to map header labels directly to Data Columns.

  • Named Group Syntax:
(?<Named_Group>RegEx Pattern)

For example, to capture a header row with "Item No.", "Description", and "Qty.", you might use:

(?<ItemNo>Item\s*No\.?)\s+
(?<Description>Description)\s+
(?<Qty>Qty\.?)
  • Step-by-Step: Configuring a Header Row Extractor Using Named Groups
  1. Create a new Extractor Object such as a Data Type or Value Reader.
  2. Set the Value Extractor on the Extractor Object to a Pattern Match.
  3. Write a Regular Expression to return the full header row of the table.
  4. In the extractor's pattern, use named groups for each individual column header you want to detect.
    • Ensure each named group matches the corresponding column label in your table.
    • If a Data Column has a name with a space in it, use an underscore in place of the space in the group name.

The following RegEx pattern is used in the example below. If you would like to follow along, feel free to copy out the RegEx.

(?<Description>DESCRIPTION)\t
(?<Quantity>HRS / QTY)\t
(?<Unit_Price>RATE / PRICE)\t
(?<Line_Total>SUBTOTAL)

Using Named Child Extractors

Alternatively, you can use a Data Type with multiple child extractors—one for each header label. This method is ideal when header labels are complex or require different extraction logic.

  • How It Works:

- Each child extractor is configured to match a specific header label. - The Collation Provider can be used to collate results from child extractors, referencing each by name. - The combined extractor is assigned as the Header Row Extractor.

  • Step-by-Step: Configuring a Header Row Extractor Using Child Extractors
  1. Create a new Data Type for the header row.
  2. Add a child extractor object (Data Type or Value Reader) for each column header (e.g., "Description", "Qty.", "Unit Price").
    • Ensure that each child extractor object is named exactly the same as the Data Column it corresponds with.
  3. Use a Collation Provider such as Ordered Array to collate the child extractors into a single result.

How to Set the Header Row Extractor in Tabular Layout

Once you have created a Row Extractor, you need to set it on the Header Row Extractor Property.

To use your Header Row Extractor in Tabular Layout:

  1. Select the Data Table node in your node tree.
  2. Set the "Extract Method" property to Tabular Layout if not already set.
  3. Expand the "Header Detection" property and locate "Header Row Extractor".
  4. Set the Header Row Extractor to a Reference.
  5. Assign your configured extractor (either with named groups or as a Data Type with child extractors) to the reference.
  6. Save your changes and run a test extraction to confirm the Header Row Extractor works.

Problems with unlined tables

When working with a table that does not have lines giving context to where the rows and columns of the tables are, Grooper has a harder time figuring out where columns are. Using the named child extractors method of configuring a Header Row Extractor relies on lines to determine columns. When you do not have lines, you can run into issues with your extraction.

If you have a table without lines on your documents that you want extracted, it is recommended to use a different method for extracting the data. You may want to set Header Extractors on each Data Column, or look into using Label Sets to detect headers.

Using label sets within Tabular Layout

A Label Set is a group of labels associated with a specific Document Type. Each label represents a possible way a data element might be named or presented in a document. For example, a Label Set for invoices might include "Invoice Number", "Inv #", and "Bill No.", all mapped to the same Data Field.

Label Sets are managed using the "Labels" tab on the Design Page for any Content Type with a Labeling Behavior enabled.

Why use Label Sets

Label Sets enable header- and footer-driven detection for tables. Tabular Layout will:

  • Read table headers from the Label Set to locate and align columns (Header Detection)
  • Read optional footer labels to establish the end of the table (Footer Detection).

This approach is ideal for semi-structured documents where the same data appears with different labels or column order.

How Tabular Layout uses Label Sets

  • Header detection: The engine reads table and column labels from the Label Set to build a Table Header Collection and snap header cells to geometric bounds. This improves alignment for value extraction across rows.
  • Footer detection: When a table's Footer label exists, it establishes the table's end line. Tabular Layout stops row detection above the footer and can optionally capture the footer row as data.
  • Column alignment: With labels present, columns are aligned to their labeled header bounds—even when column order varies—yielding consistent cell extraction.

Tips

  • Prefer header labels that cover the full header cell without vertical overlap.
  • Use "Dynamic Column Ordering" on the Data Table when documents rearrange columns.
  • For unlabeled or irregular tables, rely more on each column’s "Value Extractor" and the Tabular Layout Options fallback modes.

Benefits:

  • Rapid onboarding of new document types.
  • Increased extraction accuracy for tables with variable layouts.
  • Enables label-driven classification and extraction.

Drawbacks:

  • Requires consistent labeling on documents.
  • May need supplemental extractors for unlabeled data.

Pros and cons vs. traditional Tabular Layout

Pros

  • Faster onboarding: define labels once per Content Type; minimal per-document tuning.
  • Higher accuracy: header cells and footer rows are detected via label text (reduces false positives).
  • Supports dynamic column order: columns are aligned to their labeled headers rather than fixed positions.
  • Works with multi-line/stacked headers when labels identify the full header region.

Cons

  • Requires labeled documents: if a table has no header/footers or inconsistent labeling, you must rely more on extractors.
  • Label maintenance: changes in label wording/layout across sources require Label Set updates.
  • Overlapping header text can reduce detection accuracy; avoid vertically overlapping header labels.

How to configure:

First you will need to configure the Labeling Behavior on your Content Type (usually the Content Model). For instructions on how to add and configure a Labeling Behavior, please take a look at our wiki article on the Labeling Behavior.

Collecting Labels

Rather than setting up an extractor to collect the header labels of our table, we can use Label Sets to collect the labels instead. Label Sets are set per Document Type, so depending on how many Document Types you have in your Project, it may take more or less time to set up.

  1. Navigate to your Content Type where the Labeling Behavior is set.
  2. Click over to the "Labels" tab.
  3. If needed, select the Batch you will be working with in your Batch Viewer.
  4. Assign Document Types (Classify) the documents in your Batch.
  5. Navigate to the Data Table in your node tree.
  6. Set the Extract Method to Tabular Layout.
    • This is required for you to be able to see the labels on the "Labels" tab of your Content Type.
  7. Return to the "Labels" tab on your Content Type. You should now see labels available for your Data Table and Data Columns.
  8. Collect the full Header label for the Data Table label. You can do this by clicking your cursor inside the text box next to the Data Column label, clicking the rubber band icon at the top of the Labels panel, and then drawing a box around the header labels of your table in the Document Viewer.
    • While not strictly required, it is considered best practice to always collect a header label for the Data Table label.
    • When you set your Data Column labels, Grooper will only look inside the set Header Label for matches. The Data Table label acts as a parent label for all Data Columns.
  9. Collect the individual column header labels for each Data Column. There are three different ways to do so:
    1. Type in the text of the label on the document.
    2. Double click on the label in the Document Viewer.
    3. Click the rubber band icon and draw a box around the label on the Document Viewer.
  10. Select the second document in your Batch with a different Document Type assigned.
  11. Repeat steps 8-10 until all different Document Types in your Batch have labels.


Setting a Data Column Extractor

Once you have your labels collected, you configure everything else like you would for regular Tabular Layout Extraction. You'll need to configure an extractor on at least one Data Column for Grooper to be able to detect the rows of the table. Without an extractor on a Data Column, Grooper will not be able to detect the rows of the table and so will have no context as to where the table begins and ends.

  1. Select a Data Column in your Node Tree.
  2. Set an extractor on the Value Extractor property to collect values located in the Table's column.
    • Pattern Match is commonly used for Value Extractors on Data Columns, but any extractor can be used.
    • You can also set the Value Extractor to a Reference.
  3. Save your changes.
  4. Set more Value Extractors to the other Data Columns if needed for accurate extraction.

Advanced Configuration Options

Sometimes when setting up your Tabular Layout Extractor, you may run into situations where the basics just aren't enough to capture the data you're looking for. There are a few extra properties you can configure to improve your results.

Maximum Header Distance

Maximum Header Distance controls how far below the detected table header the first row is allowed to appear and still be considered part of the table. The distance is measured as a multiple of the average line height. If the first candidate row is farther than this limit, it will not be treated as part of the table.

What it does

  • Ensures the first row starts close enough to the header.
  • Helps avoid mistakenly including unrelated content (decorative lines, spacing, notes) as part of the table.
  • When set to 0, header proximity is not required (useful when rows begin immediately below or header detection is unreliable).

Configuration

  1. Open the Data Table that uses Tabular Layout.
  2. In "Row Detection", set Maximum Header Distance to a value that fits your layout:
    • 1.0 or 100% — first row can begin up to one line height below the header (strict).
    • 2.0 or 200% — allows a blank line or extra spacing.
    • Higher values — tolerate more vertical space before the first row. 2.0-3.0 (200%-300%) is often enough to capture the rows of the table without returning false positives above the header.
  3. After adjusting the property, run extraction on a sample document.
  4. The first data row should be detected directly under the header within the allowed distance.
  5. If no rows are detected (too strict), increase the value incrementally.
  6. If unrelated content is being treated as the first row (too permissive), decrease the value.

Tips

  • Keep the value as low as possible while accommodating typical spacing on your documents.
  • Pair with "Minimum Cell Count" and per-column Tabular Layout Options (mark critical columns as Required) to further reduce false positives.


Using Footer labels

Even after setting an extractor on a Data Column, Grooper may not be able to accurately detect rows. Often you might find that Grooper detects a row in a set of data that appears after the table on the document has ended. We can use Footer Labels to tell Grooper where to stop looking for rows, indicating where the table ends.

Without Label Sets

To add a Footer Label when you are not using Label Sets:

  1. Set up your Tabular Layout and make sure you have your Header Detection Extractor configured.
  2. Navigate to the Data Table and expand the Extract Method sub properties.
  3. Locate the Footer Detection property.
  4. Configure an extractor for the Foot Detection property to return a text segment on the document that appears shortly after the table ends.
    • Use any extractor you wish. In our example below we use a List Match.
  5. If desired, turn the Capture Footer Row property to False.
    • If set to True, you will see a blank row at the end of your extraction when you test to indicate that a Footer is being used.
    • If set to False, the blank row will be hidden during extraction.
  6. Test your extraction.


With Label Sets

To add a Footer Label:

  1. Return to the "Labels" tab on the Content Model.
  2. Click inside of the text box next to the Data Table label.
  3. In the Labels panel toolbar at the top, click the "Add a New Label" icon.
  4. Click "Add Footer".
  5. You should see a new label as a child of the Data Table Label.
  6. Collect a text segment on the document for the Footer Label that will indicate where the table ends.
  7. Test your extraction to verify accuracy.

The scope of the Footer Label is important. Make sure you add your Footer Label as a child of the Data Table rather than the Data Model.


Properties overview

Below is a comprehensive list of Tabular Layout properties, including their definitions, remarks, and use cases.

Tabular Layout Properties

Table Style
Defines the style of table extraction (Normal or Floating).
Normal extracts tables spanning the full page width and multiple pages. Floating confines extraction to a region on a single page.
Header Detection
Configures how table headers are detected, using label sets, value extractors, or column header extractors.
Use to establish table structure and boundaries.
Row Detection
Controls how table rows are identified, including required columns, minimum cell count, and alignment.
Fine-tune to improve row segmentation and reduce false positives.
Multiline Rows
Enables detection of multi-line or stacked table rows.
Use for tables with wrapped text, stacked headers, or free-form notes.
Footer Detection
Assigns a Value Extractor to identify footer content at the end of the table.
Prevents extraction beyond the intended table end.
Capture Footer Row
If enabled, includes the detected footer row as a data row in the output table.
Use when footer contains values to extract (e.g., totals).

Table Header Detector Properties

Header Row Extractor
Value Extractor for identifying header rows by pattern or content.
Use for pattern-based header detection.
Minimum Cell Count
Minimum number of header cells required for detection.
Prevents false positives in header recognition.
Maximum Line Spacing
Maximum vertical distance between header lines.
Adjust for multi-line or spaced headers.
Repair Threshold
Controls how aggressively incomplete headers are repaired.
Use for documents with inconsistent or missing header cells.
Run Global
Enables global header detection across the document.
Use for standardized forms with consistent headers.

Table Row Detector Properties

Minimum Cell Count
Minimum number of column values required for row detection.
Filters out incomplete or noisy rows.
Maximum Gap
Maximum allowed gap between detected values.
Adjust for wide or inconsistently spaced tables.
Maximum Header Distance
Maximum allowed distance from header to first row.
Use for tables with extra space after headers.
Find Column Positions
Dynamically adjusts header cell boundaries based on detected values.
Improves alignment in variable layouts.
Merge Multiple Instances
Merges multiple detected row regions into a single table.
Use for multi-page or interrupted tables.

Multiline Row Settings Properties

Maximum Lines Per Row
Maximum number of text lines per table row.
Use for wrapped descriptions or notes.
Maximum Leading Lines
Maximum number of leading lines before main row content.
Include descriptions or comments above rows.
Maximum Line Spacing
Maximum vertical distance between lines in a row.
Adjust for tightly or widely spaced lines.
Detect Page Wrap
Enables rows to span multiple pages.
Use for long descriptions or multi-page tables.
Detect Stacked Layout
Enables detection of stacked table layouts.
Use for tables with multi-level or stacked headers.

Tabular Layout Options (per-column)

Row Detection
Controls how the column is used in row detection (Required, Optional, Disabled).
Mark key columns as required for strict row boundaries.
Secondary Extract
Specifies when secondary extraction is performed (Auto, Always, Never).
Use for fallback extraction in merged or irregular cells.
Secondary Extract Mode
Determines how secondary extraction is performed (Auto, Geometric, CellExtract, RowExtract).
Select based on table structure and variability.

See also