Row Match (Table Extract Method)

From Grooper Wiki
(Redirected from Row Match)

This article is about the current version of Grooper.

Note that some content may still need to be updated.

2025 20232.80

Row Match is a Table Extract Method that uses regular expression pattern matching to determine a tables structure based on the pattern of each row and extract cell data from each column.

You may download the ZIP(s) below and upload it into your own Grooper environment (version 2025). The first contains one or more Batches of sample documents. The second contains one or more Projects with resources used in examples throughout this article.

Introduction

Row Match is a Table Extract Method used to build a table by matching whole rows rather than detecting individual column cells visually or by positional header mapping. Each hit returned by a configured row-level extractor (the "Row Extractor") is treated as a table row. Column values are then populated from named child results (such as regex named groups) or, optionally, by running individual column value extractors recursively on the row text.

Compared to other methods:

  • Tabular Layout focuses on locating column headers and aligning values beneath them; ideal for structured, header-driven grids.
  • Grid Layout infers a matrix from intersecting row and column headers (X and Y axes).
  • Fluid Layout dynamically chooses between Tabular Layout and Row Match based on label presence.
  • Delimited Extract (Delimited/CSV) parses external delimited files instead of page content.
  • Fixed Width slices rows by character spans in monospaced text layouts.
  • AI Table Reader uses a Large Language Model to interpret complex or semi-structured regions.

Row Match shines when headers are absent, the layout is irregular, or pattern-based recognition is faster to configure than positional alignment.

Purpose of Row Match

Ideal use cases

  • Lists or line-item style data without reliable column headers (e.g. charge lines, remark code lists).
  • Pattern-consistent rows that can be described with a single regular expression or Data Type collation - e.g.
(?<Qty>\d+)\s+(?<Unit_Price>[\d,.]+)\s+(?<Total>[\d,.]+)
  • Semi-structured tables where column order is stable inside a line but headers are missing or unpredictable.
  • Scenarios where rapid onboarding is needed—one extractor produces the entire row.

Benefits

  • Faster initial configuration: one Row Extractor can populate all columns.
  • Works with header-less or minimal tables.
  • Supports named group mapping and recursive fallback extraction (via "Options" flags).
  • Label Set–aware for optional header/footer bounding.
  • Can enforce spacing and alignment to reduce false positives.

Drawbacks

  • Less resilient to column reordering mid-document (no per-column header detection).
  • Requires robust row pattern design; errors in the row pattern can misalign multiple columns at once.
  • Limited geometric refinement compared to Tabular Layout for complex multi-line cells.
  • Multi-page logic must be explicitly constrained (e.g. "SinglePage" option) if spillover occurs.

How to add and configure the Row Match Table Extract Method

General setup steps

Following are instructions for general setup of the Row Match Table Extract Method.

FYI

Please see the demos below for example setups with screenshots and highlighted instructions.

  1. Create the Data Table
    1. In the tree, right-click the parent Data Model or Data Section and choose "Add" → "Data Table".
    2. Name the table (e.g. "Line Items"). This becomes the table's "Code Name" reference internally.
  2. Add Data Columns
    1. Select the Data Table and add each required Data Column (e.g. Quantity, Unit Price, Total).
    2. For each Data Column, configure its "Value Extractor" if it will not be supplied by the Row Extractor’s named groups.
    3. Optional: Set a "Footer Mode" for numeric columns (e.g. Calculate, Validate) if totals need validation later.
  3. Set the Extract Method to Row Match
    1. Select the Data Table.
    2. In the properties pane, set "Extract Method" to Row Match.
    3. Expand the Row Match object to display its properties.
  4. Configure the Row Extractor
    1. Choose a suitable Value Extractor (commonly a Pattern Match or a Data Type with a 2D collation provider).
    2. If using regex, define named groups that match your Data Column names (use underscores instead of spaces).
      • This is only necessary if not setting independent Value Extractors on individual Data Columns.
    3. Test the extractor independently (Extractor Test panel) to confirm that each row is matched and named groups return expected values.
  5. (Optional) Configure Header/Footer bounding
    1. If a distinct header or footer line marks the table region, set "Header Extractor" and/or "Footer Extractor".
  6. Test extraction
    1. Run a Batch through an Extract step or use the Tester tab of the Data Table or Data Model.
    2. Review the resulting Table Instance: Confirm row count, column population, spacing, and footer inclusion.
    3. Adjust regex groups, spacing thresholds, or alignment tolerance as needed.
  7. Troubleshoot common issues
    1. Missing rows: Loosen pattern or reduce "Maximum Row Spacing" constraints; verify header spacing if a header is required.
    2. Extra rows: Tighten regex, raise alignment requirements, or introduce a footer extractor to stop scanning.
    3. Empty cells: Verify group names exactly match Data Column names (case and underscores) or enable "Recursive" option.

Example: Value Extractors on Data Columns

One way to extract rows, and subsequently the values within those rows, is to set a simple pattern on the Data Table using Row Match to find entire rows within the target table. Once rows are formed via this pattern on the Data Table, extractors can be configured on each Data Column to target and find specific values within that returned row of information.

  1. Select the "Column Extractors" Data Table from the provided Project in the Node Tree.
  2. Notice the Extract Method property is set to "Row Match".
  3. Click the drop-down arrow to the left of the Extract Method property set to "Pattern Match" to expand its sub-properties.
    • Notice the Row Extractor property is set to "Pattern Match".
  4. Click the ellipsis button of the Row Extractor property set to "Pattern Match" to open the Row Extractor editor.
  5. Click the "Select Batch" button in the Batch Viewer.
  6. Make sure to select the provided "Row-Match" Batch.
  7. Notice a simple Value, Prefix, and Suffix pattern are set.
    • This simple pattern is enough to return each row instance of the table.
  8. Select a child Data Column of the "Column Extractors" Data Table.
  9. Notice the Value Extractor property is set to "Pattern Match".
  10. Click the ellipsis button of the Value Extractor property set to "Pattern Match" to open the Value Extractor editor.
  11. Notice the Value Pattern is configured to find values appropriate to this Data column.
  12. You'll see all results appropriate to this Data Column are returned from each row of the table.
  13. Inspect the remaining child Data Columns to see similar configurations and results, with each configuration and result set appropriate to each respective Data Column.
  14. Select the parent "Column Extractors" Data Table, then click the Tester tab.
  15. Select the Batch Folder of the "Row-Match" Batch in the Batch Viewer.
  16. Click the "Test Extraction" button.
  17. The Data Model Preview will be populated with rows of information and appropriate data in each column.

Example: Sub-elements in named regex groups

Another way to use Row Match is to set one regex pattern on the Data Table using Row Match. This one pattern will not only return entire rows of information, but because this pattern will use named capture groups (the names of which should match the target Data Columns exactly) it will return the results within those capture groups to each target Data Column.

  1. Select the "Named Sub-Elements" Data Table from the provided Project in the Node Tree.
  2. Notice the Extract Method property is set to "Row Match".
  3. Click the drop-down arrow to the left of the Extract Method property set to "Row Match" to expand its sub-properties.
  4. Notice the Row Extractor property is set to "Pattern Match".
  5. Click the ellipsis button for the Row Extractor property set to "Pattern Match" to open the Row Extractor editor.
  6. Click the "Select Batch" button in the Batch Viewer.
  7. Make sure to select the provided "Row-Match" Batch.
  8. Select the first Batch Folder in the Batch Viewer.
  9. Notice the expression in the Value Pattern is using named capture groups in its definition of a table row.
    • The names of the capture groups match exactly the names of the child Data Columns of this Data Table.
  10. The pattern successfully returns each row instance of the table.
  11. Click the "Data Inspector" button.
  12. Select the parent instance.
    • This represents an entire row of data from the table.
  13. Select a sub-element child instance.
    • Each sub-element child instance represents data respective to that column formed within the entire row instance defined by the parent.
  14. Select a child Data Column in the Node Tree.
  15. Notice the Value Extractor does not need to be set on child Data Columns when the Row Extractor set on the parent Data Table creates named sub-elements.
  16. Inspecting the remaining child Data Columns will show similar configuration.
  17. Select the parent "Named Sub-Elements" Data Table.
  18. Click the "Tester" tab.
  19. Select the first Batch Folder in the Batch Viewer.
  20. Click the "Test Extraction" button.
  21. The Data Model Preview will be populated with rows of information and appropriate data in each column.

Example: Sub-elements in ordered array

Yet another way to capture rows of information, and return the data for each column, is to create a Data Type with child extractors. These child extractors should be named exactly as the Data Columns for which they are to return information. Each child extractor should return data relevant to their target Data Column. Once set, the parent Data Type can be configured using the Ordered Array Collation Method. This partent Data Type can then be referenced by the Data Type using Row Match to return the desired tabular data.

  1. Select the "Named Sub-Elements" Data Type from the provided Project in the Node Tree.
    • Notice it is configured as a horizontal ordered array. This configuration will allow all results returned by it, and child elements, to be combined into each row instance.
  2. Select a child Value Reader.
    • Notice the child Value Readers are named exactly the same as the child Data Columns of the target Data Table.Notice the Extractor property is configured, in this case, "Pattern Match".
  3. Click the Tester tab.
  4. Click the "Select Batch" button in the Batch Viewer, then make sure to select the provided "Row-Match" Batch.
  5. Select the first Batch Folder in the Batch Viewer.
    • Notice a simple Value Pattern is set to collect dates.
  6. You'll se all results from the table rows appropriate to this object are returned.
  7. Inspecting the remaining child Value Readers will show similar configurations.
  8. Select the parent "Named Sub-Elements" Data Type.
    • All results from the child Value Readers are combined to form each row instance.
  9. Click the "Data Inspector" button.
  10. Select the parent data instance.
    • Notice the entire row instance is returned.
  11. Select one of the child sub-elements.
    • Data appropriate to each sub-element from within each row will be displayed.
  12. Select the "Named Sub-Elements part 2" Data Table.
    • Notice the Extract Method property is set to "Row Match" and it is pointed at the "Named Sub-Elements" Data Type as a reference.
  13. Click the Tester tab.
  14. Select the first Batch Folder in the Batch Viewer, then click the "Test Extraction" button.
  15. The Data Model Preview will be populated with rows of information and appropriate data in each column.

Example: Key-value lists in ordered array for missing cells

In some circumstances tables will have cells of information that are blank or "null". The previous methods of defining rows with extractors depended on all pieces of information present in order for a row to be formed. However, this can be circumvented using key-value lists, then combining them in ordered arrays to form the rows and return appropriate data, even in situations where rows may have missing information.

  1. Select the "OrderAmount" Data Type within the "Key-Value Lists" folder of the Local Resources in the provided Project.
    • Note this, and the other Data Types in this folder are named exactly as the child Data Columns of the target Data Table. Notice this Data Type is configured as a vertical key-value list.
  2. Click the Tester tab.
  3. Click the "Select Batch" button in the Batch Viewer, then make sure to select the "Row-Match" Batch.
  4. Select the second Batch Folder in the Batch Viewer.
    • Notice this extractor is collecting all the values listed under the appropriate column header.
  5. Inspecting the other Data Types in this folder will show similar configurations.
  6. Select the "ordered array horizontal all" Data Type within the "Missing Cells" folder of the Local Resources folder in the provided Project.
    • Notice this Data Type is configured as a horizontal ordered array.
  7. The results of this Data Type are supplied by five referenced extractors. Click the ellipsis button to the right of the Referenced Extractors property to open the Referenced Extractors editor.
  8. All of the key-value list Data Types are being referenced. All of these extractors must return results from left-to-right in order for a row instance to be returned.
    • Notice the referenced extractors are ordered from top-to-bottom respective of the left-to-right order of the columns of the table in the document.
  9. Select the "ordered array horizontal OrderDate-OrderId-Units-OrderAmount" Data Type.
    • This Data Type is configured just like the previous Data Type, but it is only referencing four extractors.
  10. Click the ellipsis button to the right of the Referenced Extractors property to open the Referenced Extractors editor.
  11. Notice only four of the five key-value list column extractors are referenced.
    • This allows rows where only these four columns are present to return rows.
  12. The remaining two ordered array Data Types are variants of row instances where only these columns are present.
  13. Select the "Missing Cells" Data Type, then click the ellipsis button to the right of the Referenced Extractors property to open the Referenced Extractors editor.
  14. Notice all four variations of the ordered arrays are referenced by this Data Type.
  15. With the "Deduplicate By" property set to area, all the values that overlap by the area of the returned result will be deduplicated.
    • This is necessary as the different variations of the horizontal ordered arrays will have some overlapping values.
  16. Click the Tester tab.
  17. All row instances, even those with missing cells, are fully returned by this one Data Type.
  18. Select the "Missing Cells" Data Table.
    • Notice the Row Match Extract Method is used, referencing the "Missing Cells" Data Type.
  19. Click the Tester tab.
  20. Select the second Batch Folder in the Batch Viewer, then click the "Test Extraction" button.
  21. The Data Model Preview will be populated with rows of information and appropriate data in each column, including those with missing cells.

Using Label Sets with Row Match

Label Sets allow Row Match to locate a table's start and end using labeled headers and footers from the document, instead of relying on local extractors. This improves boundary accuracy on semi-structured layouts.

How it works

  • Row Match reads the Label Sets defined on the document's Content Type (via Labeling Behavior).
  • If "Use Labelset" is enabled, Row Match:
    • Detects the table header using the Data Table's header label and (optionally) its column labels.
    • Detects the table footer using the Data Table's "Footer" label.
    • Extracts rows only between the detected header and footer, respecting spacing rules.

Prerequisites

  • The Content Type includes Labeling Behavior (enables the Labels tab and fuzzy label matching).
  • A Label Set exists for the Content Type with:
    • A header label mapped to the target Data Table (the table’s title/start text).
    • Optional column header labels mapped to each Data Column (helps header context and diagnostics).
    • An optional footer label mapped to the Data Table using the special label name Footer.

Configuration (step-by-step)

  1. Open the target Data Table and set "Extract Method" to Row Match.
  2. In Row Match properties, set:
    • "Options" → enable Use Labelset.
    • Leave "Header Extractor" and "Footer Extractor" blank (labels will drive boundaries).
    • Optionally set "Maximum Header Spacing" (percent of font size) to limit the gap from the header label to the first row.
    • Optionally set "Maximum Row Spacing" to limit gaps between rows.
  3. Open the Content Type's Labels tab (provided by the Labeling Behavior), then:
    • Capture a header label for the Data Table by selecting the table's title/start text and mapping it to the Data Table.
    • Optionally capture column header labels for each Data Column.
    • Optionally add and capture a footer label for the Data Table using the label name Footer (e.g., “Total”, “Grand Total”, etc.).

Execution
During extract, Row Match:

  • Loads the Label Set when "Use Labelset" is enabled.
  • Reads table headers for the Data Table (and includes any column labels found).
  • Reads footer labels named Footer for the same Data Table.
  • Runs the "Row Extractor" to find candidate rows, then keeps only rows between the header and footer, applying "Maximum Header Spacing" and "Maximum Row Spacing".

Results and behavior notes

  • If no header label is detected and "Maximum Header Spacing" > 0%, the table is skipped (no rows returned).
  • A detected footer label cleanly ends the table at the correct boundary.
  • Column labels are optional for Row Match; column values are still populated by the "Row Extractor" (named groups or child extractors).

Tips and troubleshooting

  • Ensure the header label is mapped to the Data Table (not to a Data Column).
  • Use Labeling Behavior properties to tune matches:
    • "Header Similarity" for header recognition strictness.
    • "Label Similarity" and "Weightings" for general label fuzziness.
    • "Constrained Wrap" or "Vertical Wrap" to recognize multi-line or stacked headers.
  • If some documents lack a clear header, set "Maximum Header Spacing" to 0% to allow headerless starts (or add a suitable header label).
  • Use "Options" → SinglePage if the table should not span pages.

When to prefer Label Sets
Use Label Sets with Row Match when the table area is reliably signaled by a header/footer on the page, but the rows themselves are best read by a single-row pattern (named groups/child extractors). This hybrid approach yields precise table boundaries with minimal per-document tuning.

Properties overview

Below is a comprehensive list of Row Match properties and nested configuration elements. Each entry includes the property name (UI label), definition, remarks-derived guidance, and primary use case.

  • Row Extractor
    • Definition: The extractor used to match table rows and populate column values from its child results.
    • Remarks: Must return 2D results (row instances) with named children for each Data Column or rely on recursive fallback. Supports regex named groups (no spaces—use underscores) or Data Type child extractors with 2D collation (Array, Ordered Array, Pattern-Based).
    • Purpose: Central logic for converting raw page content into discrete table rows.
  • Header Extractor
    • Definition: Optional extractor marking the start (top) of the table region.
    • Remarks: Rows before header are discarded. Enables enforcing "Maximum Header Spacing". Prefer Label Set footer/header labels when using "UseLabelset".
    • Purpose: Prevents accidental inclusion of content above the table and anchors vertical filtering.
  • Footer Extractor
    • Definition: Optional extractor marking the bottom of the table.
    • Remarks: Rows after footer are discarded. Useful for summary lines or terminating sections.
    • Purpose: Stops extraction at intended boundary and avoids trailing noise.
  • Maximum Header Spacing
    • Definition: Max vertical distance (percent of font height) allowed between the header and first row (0% disables check).
    • Remarks: If > 0 and no headers are detected, extraction halts. Helps eliminate distant unrelated blocks.
    • Purpose: Tightens vertical proximity of the first data row to the header.
  • Maximum Row Spacing
    • Definition: Max vertical gap (percent of font height) allowed between consecutive rows (0% disables).
    • Remarks: Table terminates when gap exceeds this value. Prevents inclusion of separated content lines.
    • Purpose: Maintains continuity of the table region.
  • Row Alignment
    • Definition: Optional Row Alignment Settings object enforcing left, center, right, or any alignment between consecutive rows.
    • Remarks: Filters out misaligned or indented content; applied after header for non-header rows. Supports tolerance-based deviation.
    • Purpose: Improves precision by requiring geometric alignment.
      • Alignment
        • Definition: Required alignment type(s) for row-to-row comparison (Any, Left, Center, Right; flags combinable).
        • Remarks: If multiple flags are set, meeting any one passes. "Any" permits vertical stacking with horizontal gap limited by tolerance.
        • Purpose: Restricts accepted rows to consistent horizontal positioning patterns.
          • Any: Rows must be vertically stacked; horizontal overlap within tolerance is sufficient.
          • Left: Row left edges must align within tolerance.
          • Center: Row center points must align within tolerance.
          • Right: Row right edges must align within tolerance.
      • Tolerance
        • Definition: Horizontal deviation allowed (e.g. 4pt, 0.1in).
        • Remarks: Smaller values enforce strict alignment; larger values absorb scanning or formatting shifts.
        • Purpose: Fine-tunes sensitivity of alignment filtering.
  • Options
    • Definition: A set of RowMatchOptions flags controlling behavior.
    • Remarks: Combine flags to tailor extraction. See enumeration values below.
    • Purpose: Enables page-span control, recursive fill, label-set use, and data cleaning.
      • None: Standard behavior; no special features.
      • SinglePage: Prevents table from spanning multiple pages; rows must remain on header’s page (or first row’s page if no header).
      • Recursive: For columns missing a child match, runs each Data Column’s value extractor against full row text.
      • UseLabelset: Uses Label Set–based header/footer detection instead of local Header/Footer Extractors.
      • CleanControlCharacters: Replaces tab, form feed, newline sequences with spaces for normalized cell values.

Internal behavior notes (for end-user understanding)

  • Header + spacing interplay: If "Maximum Header Spacing" > 0 but no header (via Header Extractor or Label Set) is found, no rows are returned.
  • Footer handling: Only the first footer match below the header bounds extraction; additional matches are ignored.
  • Recursive population: Only engages for columns whose values were not directly supplied by the Row Extractor result.

Configuration tips

  • Start with a simple, precise regex for the "Row Extractor"; expand cautiously to avoid overmatching.
  • Use named groups identical (case/underscore) to Data Column "Code Name" values for automatic cell mapping.
  • Introduce "Maximum Row Spacing" only after confirming consistent row vertical distances across samples.
  • Apply "Row Alignment" last—overly strict tolerance can exclude valid rows.
  • When migrating from Tabular Layout for headerless tables, replicate per-column extractors inside the row pattern or enable "Recursive".

Testing and troubleshooting

Testing workflow

  1. Run extraction on a representative Batch containing multiple page examples.
  2. Inspect row count and cell values in Review: confirm all required columns populate.
  3. Use the Data Inspector to view header/footer detection and row counts.

Troubleshooting matrix

  • Rows missing – Relax regex anchors; verify header presence if "Maximum Header Spacing" used.
  • Extra rows – Tighten pattern; reduce allowed spacing; consider adding a footer extractor.
  • Misaligned data – Confirm group names; ensure no unintended whitespace grouping; adjust "Tolerance".
  • Empty cells – Check spelling of group names; enable "Recursive"; verify column extractor logic.

When to choose another method

  • Use Tabular Layout if dependable headers exist and multi-line column stacking matters.
  • Use Grid Layout if both row and column headers form a matrix.
  • Use Fixed Width for monospaced character-aligned reports.
  • Use Delimited Extract for external CSV or TSV sources (not page text).
  • Use AI Table Reader for highly variable, narrative-style or semi-structured documents where patterns are not stable.
  • Use Fluid Layout when some document types have full headers and others only a section header.