Row Match (Table Extract Method): Difference between revisions

From Grooper Wiki
Line 48: Line 48:
=== General setup steps ===
=== General setup steps ===
Following are instructions for general setup of the Row Match Table Extract Method.<br>
Following are instructions for general setup of the Row Match Table Extract Method.<br>
Walkthroughs with an example will follow this general setup.
Walkthroughs with examples will follow this general setup.
# '''Create the Data Table'''
# '''Create the Data Table'''
## In the tree, right-click the parent [[Data Model]] or [[Data Section]] and choose "Add" → "Data Table".
## In the tree, right-click the parent [[Data Model]] or [[Data Section]] and choose "Add" → "Data Table".

Revision as of 08:38, 19 November 2025

This article is about the current version of Grooper.

Note that some content may still need to be updated.

2025 20232.80

Row Match is a Table Extract Method that uses regular expression pattern matching to determine a tables structure based on the pattern of each row and extract cell data from each column.

You may download the ZIP(s) below and upload it into your own Grooper environment (version 2025). The first contains one or more Batches of sample documents. The second contains one or more Projects with resources used in examples throughout this article.

Introduction

Row Match is a Table Extract Method used to build a table by matching whole rows rather than detecting individual column cells visually or by positional header mapping. Each hit returned by a configured row-level extractor (the "Row Extractor") is treated as a table row. Column values are then populated from named child results (such as regex named groups) or, optionally, by running individual column value extractors recursively on the row text.

Compared to other methods:

  • Tabular Layout focuses on locating column headers and aligning values beneath them; ideal for structured, header-driven grids.
  • Grid Layout infers a matrix from intersecting row and column headers (X and Y axes).
  • Fluid Layout dynamically chooses between Tabular Layout and Row Match based on label presence.
  • Delimited Extract (Delimited/CSV) parses external delimited files instead of page content.
  • Fixed Width slices rows by character spans in monospaced text layouts.
  • AI Table Reader uses a Large Language Model to interpret complex or semi-structured regions.

Row Match shines when headers are absent, the layout is irregular, or pattern-based recognition is faster to configure than positional alignment.

Purpose of Row Match

Ideal use cases

  • Lists or line-item style data without reliable column headers (e.g. charge lines, remark code lists).
  • Pattern-consistent rows that can be described with a single regular expression or Data Type collation - e.g.
(?<Qty>\d+)\s+(?<Unit_Price>[\d,.]+)\s+(?<Total>[\d,.]+)
  • Semi-structured tables where column order is stable inside a line but headers are missing or unpredictable.
  • Scenarios where rapid onboarding is needed—one extractor produces the entire row.

Benefits

  • Faster initial configuration: one Row Extractor can populate all columns.
  • Works with headerless or minimal tables.
  • Supports named group mapping and recursive fallback extraction (via "Options" flags).
  • Label Set–aware for optional header/footer bounding.
  • Can enforce spacing and alignment to reduce false positives.

Drawbacks

  • Less resilient to column reordering mid-document (no per-column header detection).
  • Requires robust row pattern design; errors in the row pattern can misalign multiple columns at once.
  • Limited geometric refinement compared to Tabular Layout for complex multi-line cells.
  • Multi-page logic must be explicitly constrained (e.g. "SinglePage" option) if spillover occurs.

How to add and configure the Row Match Table Extract Method

General setup steps

Following are instructions for general setup of the Row Match Table Extract Method.
Walkthroughs with examples will follow this general setup.

  1. Create the Data Table
    1. In the tree, right-click the parent Data Model or Data Section and choose "Add" → "Data Table".
    2. Name the table (e.g. "Line Items"). This becomes the table's "Code Name" reference internally.
  2. Add Data Columns
    1. Select the Data Table and add each required Data Column (e.g. Quantity, Unit Price, Total).
    2. For each Data Column, configure its "Value Extractor" if it will not be supplied by the Row Extractor’s named groups.
    3. Optional: Set a "Footer Mode" for numeric columns (e.g. Calculate, Validate) if totals need validation later.
  3. Set the Extract Method to Row Match
    1. Select the Data Table.
    2. In the properties pane, set "Extract Method" to Row Match.
    3. Expand the Row Match object to display its properties.
  4. Configure the Row Extractor
    1. Choose a suitable Value Extractor (commonly a Pattern Match or a Data Type with a 2D collation provider).
    2. If using regex, define named groups that match your Data Column names (use underscores instead of spaces).
      • This is only necessary if not setting independant Value Extractors on individual Data Columns.
    3. Test the extractor independently (Extractor Test panel) to confirm that each row is matched and named groups return expected values.
  5. (Optional) Configure Header/Footer bounding
    1. If a distinct header or footer line marks the table region, set "Header Extractor" and/or "Footer Extractor".
  6. Test extraction
    1. Run a Batch through an Extract step or use the design-time Test feature.
    2. Review the resulting Table Instance: Confirm row count, column population, spacing, and footer inclusion.
    3. Adjust regex groups, spacing thresholds, or alignment tolerance as needed.
  7. Troubleshoot common issues
    1. Missing rows: Loosen pattern or reduce "Maximum Row Spacing" constraints; verify header spacing if a header is required.
    2. Extra rows: Tighten regex, raise alignment requirements, or introduce a footer extractor to stop scanning.
    3. Empty cells: Verify group names exactly match Data Column names (case and underscores) or enable "Recursive" option.

Example: Value Extractors on Data Columns

  1. Select the "Column Extractors" Data Table from the provided Project in the Node Tree.
  2. Notice the Extract Method property is set to "Row Match".
  3. Click the drop-down arrow to the left of the Extract Method property set to "Pattern Match" to expand its sub-properties. # Notice the Row Extractor property is set to "Pattern Match".
  4. Click the ellipsis button of the Row Extractor property set to "Pattern Match" to open the Row Extractor editor.
  5. Click the "Select Batch" button in the Batch Viewer.
  6. Make sure to select the provided "Row-Match" Batch.
  7. Notice a simple Value, Prefix, and Suffix pattern are set.
    • This simple pattern is enough to return each row instance of the table.
  8. Select a child Data Column of the "Column Extractors" Data Table.
  9. Notice the Value Extractor property is set to "Pattern Match".
  10. Click the ellipsis button of the Value Extractor property set to "Pattern Match" to open the Value Extractor editor.
  11. Notice the Value Pattern is configured to find values appropriate to this Data column.
  12. You'll see all results appropriate to this Data Column are returned from each row of the table.
  13. Inspect the remaining child Data Columns to see similar configurations and results, with each configuration and result set appropriate to each respective Data Column.
  14. Select the parent "Column Extractors" Data Table, then click the Tester tab.
  15. Select the Batch Folder of the "Row-Match" Batch in the Batch Viewer.
  16. Click the "Test Extraction" button.
  17. The Data Model Preview will be populated with rows of information and appropriate data in each column.

Example: Sub-Elements created within the Row Match extractor

  1. Select the "Named Sub-Elements" Data Table from the provided Project in the Node Tree.
  2. Notice the Extract Method property is set to "Row Match".
  3. Click the drop-down arrow to the left of the Extract Method property set to "Row Match" to expand its sub-properties.
  4. Notice the Row Extractor property is set to "Pattern Match".
  5. Click the ellipsis button for the Row Extractor property set to "Pattern Match" to open the Row Extractor editor.
  6. Click the "Select Batch" button in the Batch Viewer.
  7. Make sure to select the provided "Row-Match" Batch.
  8. Select the first Batch Folder in the Batch Viewer.
  9. Notice the expression in the Value Pattern is using named capture groups in its definition of a table row.
    • The names of the capture groups match exactly the names of the child Data Columns of this Data Table.
  10. The pattern successfully returns each row instance of the table.
  11. Click the "Data Inspector" button.
  12. Select the parent instance.
    • This represents an entire row of data from the table.
  13. Select a sub-element child instance.
    • Each sub-element child instance represents data respective to that column formed within the entire row instance defined by the parent.
  14. Select a child Data Column in the Node Tree.
  15. Notice the Value Extractor does not need to be set on child Data Columns when the Row Extractor set on the parent Data Table creates named sub-elements.
  16. Inspecting the remaining child Data Columns will show similar configuration.
  17. Select the parent "Named Sub-Elements" Data Table.
  18. Click the "Tester" tab.
  19. Select the first Batch Folder in the Batch Viewer.
  20. Click the "Test Extraction" button.
  21. The Data Model Preview will be populated with rows of information and appropriate data in each column.

Properties overview

Below is a comprehensive list of Row Match properties and nested configuration elements. Each entry includes the property name (UI label), definition, remarks-derived guidance, and primary use case.

  • Row Extractor
    • Definition: The extractor used to match table rows and populate column values from its child results.
    • Remarks: Must return 2D results (row instances) with named children for each Data Column or rely on recursive fallback. Supports regex named groups (no spaces—use underscores) or Data Type child extractors with 2D collation (Array, Ordered Array, Pattern-Based).
    • Purpose: Central logic for converting raw page content into discrete table rows.
  • Header Extractor
    • Definition" Optional extractor marking the start (top) of the table region.
    • Remarks: Rows before header are discarded. Enables enforcing "Maximum Header Spacing". Prefer Label Set footer/header labels when using "UseLabelset".
    • Purpose: Prevents accidental inclusion of content above the table and anchors vertical filtering.
  • Footer Extractor
    • Definition: Optional extractor marking the bottom of the table.
    • Remarks: Rows after footer are discarded. Useful for summary lines or terminating sections.
    • Purpose: Stops extraction at intended boundary and avoids trailing noise.
  • Maximum Header Spacing
    • Definition: Max vertical distance (percent of font height) allowed between the header and first row (0% disables check).
    • Remarks: If > 0 and no headers are detected, extraction halts. Helps eliminate distant unrelated blocks.
    • Purpose: Tightens vertical proximity of the first data row to the header.
  • Maximum Row Spacing
    • Definition: Max vertical gap (percent of font height) allowed between consecutive rows (0% disables).
    • Remarks: Table terminates when gap exceeds this value. Prevents inclusion of separated content lines.
    • Purpose: Maintains continuity of the table region.
  • Row Alignment
    • Definition: Optional Row Alignment Settings object enforcing left, center, right, or any alignment between consecutive rows.
    • Remarks: Filters out misaligned or indented content; applied after header for non-header rows. Supports tolerance-based deviation.
    • Purpose: Improves precision by requiring geometric alignment.
      • Alignment
        • Definition: Required alignment type(s) for row-to-row comparison (Any, Left, Center, Right; flags combinable).
        • Remarks: If multiple flags are set, meeting any one passes. "Any" permits vertical stacking with horizontal gap limited by tolerance.
        • Purpose: Restricts accepted rows to consistent horizontal positioning patterns.
          • Any: Rows must be vertically stacked; horizontal overlap within tolerance is sufficient.
          • Left: Row left edges must align within tolerance.
          • Center: Row center points must align within tolerance.
          • Right: Row right edges must align within tolerance.
      • Tolerance
        • Definition: Horizontal deviation allowed (e.g. 4pt, 0.1in).
        • Remarks: Smaller values enforce strict alignment; larger values absorb scanning or formatting shifts.
        • Purpose: Fine-tunes sensitivity of alignment filtering.
  • Options
    • Definition: A set of RowMatchOptions flags controlling behavior.
    • Remarks: Combine flags to tailor extraction. See enumeration values below.
    • Purpose: Enables page-span control, recursive fill, label-set use, and data cleaning.
      • None: Standard behavior; no special features.
      • SinglePage: Prevents table from spanning multiple pages; rows must remain on header’s page (or first row’s page if no header).
      • Recursive: For columns missing a child match, runs each Data Column’s value extractor against full row text.
      • UseLabelset: Uses Label Set–based header/footer detection instead of local Header/Footer Extractors.
      • CleanControlCharacters: Replaces tab, form feed, newline sequences with spaces for normalized cell values.

Internal behavior notes (for end-user understanding)

  • Header + spacing interplay: If "Maximum Header Spacing" > 0 but no header (via Header Extractor or Label Set) is found, no rows are returned.
  • Footer handling: Only the first footer match below the header bounds extraction; additional matches are ignored.
  • Recursive population: Only engages for columns whose values were not directly supplied by the Row Extractor result.

Configuration tips

  • Start with a simple, precise regex for the "Row Extractor"; expand cautiously to avoid overmatching.
  • Use named groups identical (case/underscore) to Data Column "Code Name" values for automatic cell mapping.
  • Introduce "Maximum Row Spacing" only after confirming consistent row vertical distances across samples.
  • Apply "Row Alignment" last—overly strict tolerance can exclude valid rows.
  • When migrating from Tabular Layout for headerless tables, replicate per-column extractors inside the row pattern or enable "Recursive".

Testing and troubleshooting

Testing workflow

  1. Run extraction on a representative Batch containing multiple page examples.
  2. Inspect row count and cell values in Review: confirm all required columns populate.
  3. Use the Data Inspector to view header/footer detection and row counts.

Troubleshooting matrix

  • Rows missing – Relax regex anchors; verify header presence if "Maximum Header Spacing" used.
  • Extra rows – Tighten pattern; reduce allowed spacing; consider adding a footer extractor.
  • Misaligned data – Confirm group names; ensure no unintended whitespace grouping; adjust "Tolerance".
  • Empty cells – Check spelling of group names; enable "Recursive"; verify column extractor logic.

When to choose another method

  • Use Tabular Layout if dependable headers exist and multi-line column stacking matters.
  • Use Grid Layout if both row and column headers form a matrix.
  • Use Fixed Width for monospaced character-aligned reports.
  • Use Delimited Extract for external CSV or TSV sources (not page text).
  • Use AI Table Reader for highly variable, narrative-style or semi-structured documents where patterns are not stable.
  • Use Fluid Layout when some document types have full headers and others only a section header.

Summary

Row Match offers a streamlined, pattern-centric approach to table extraction in Grooper, ideal for headerless or irregular tables. Properly tuned row patterns, alignment constraints, and spacing limits enable fast, reliable capture of repeating data without the overhead of full column header detection.