AI Table Reader (Table Extract Method)

From Grooper Wiki
Revision as of 17:17, 12 December 2025 by Randallkinard (talk | contribs) (Randallkinard moved page AI Table Reader to AI Table Reader (Table Extract Method) over redirect)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

This article is about the current version of Grooper.

Note that some content may still need to be updated.

2025

AI Table Reader is a Table Extract Method that uses generative AI to to extract tabular data from document content. It leverages a configured LLM Connector to interpret quoted document text, produce a JSON response that matches your table schema, and map that data back to the table’s rows and cells.

You may download the ZIP(s) below and upload it into your own Grooper environment (version 2025). The first contains one or more Batches of sample documents. The second contains one or more Projects with resources used in examples throughout this article.

Introduction

AI Table Reader extracts tabular data by sending a schema‑guided prompt and a document quote to a large language model (LLM). The results are parsed and returned to the contents of the columns of the table.

How it differs from other extract methods:

  • Tabular Layout: Detects headers/rows/footers based on labels and layout analysis. AI Table Reader instead interprets text via an LLM and maps JSON output back to table rows.
  • Row Match: Uses a 2D row extractor (regex or data type) to detect rows and map named groups to columns. AI Table Reader does not rely on named groups; it relies on an LLM’s structured output.
  • Grid Layout: Infers a matrix from X/Y headers detected on the page. AI Table Reader is not header‑grid dependent; it interprets quoted content broadly.
  • Fixed Width: Splits rows by fixed character widths. AI Table Reader does not require fixed positions.
  • Delimited Extract: Parses CSV/TSV using delimiters and mapping rules. AI Table Reader avoids strict file parsing and interprets natural language text.
  • Fluid Layout: Chooses between Tabular Layout and Row Match based on label sets. AI Table Reader is a separate AI‑driven approach focused on prompt/schema‑guided extraction.

What it is for?

Ideal use cases:

  • Tables or lists with inconsistent formatting, embedded in narrative text, or where headers/rows are ambiguous.
  • Rapid adaptation to new layouts using schema and instructions rather than writing patterns and label sets.
  • Complex line items or transaction logs that vary across documents.

Benefits:

  • Interprets messy or variable layout content without strict header/line detection.
  • Schema‑guided output enables consistent column mapping.
  • Optional header/footer detection and row alignment help place AI‑extracted values into the correct rows.

Drawbacks:

  • Requires an LLM Connector and network access.
  • Quality depends on prompt clarity and model accuracy; results may vary with noisy inputs.
  • May be slower and incur usage costs compared to rule‑based extractors.
  • Requires careful testing and diagnostics review.

How to add and configure AI Table Reader

Following is a general walkthrough.

FYI

Please see the demo below for an example setup with screenshots and highlighted instructions.

General Setup Steps

  1. In your Data Model, add a Data Table and define its Data Columns.
  2. On the Data Table, set "Extract Method" to AI Table Reader.
  3. Configure "Included Columns" to focus the LLM on needed fields.
  4. Provide "Instructions" to clarify expected formats and rules.
  5. Set "Document Quoting" to limit and format the quote region appropriately.
  6. Adjust "Row Alignment" and, if needed, enable "Detect Header", "Detect Footer", and configure "Multiline Rows".
  7. Review "Alignment" defaults; keep or refine for your layout style.
  8. Ensure your Root has a working LLM connector; AI Table Reader requires it.
  9. Use the Tester tab of the Data Table. Select a Batch Folder in the Batch Viewer, then click the "Test Extraction" button.

Tips for testing and troubleshooting

  • AI Table Reader runs with extractors. AI Extract runs with Fill Methods. Extractors run before Fill Methods.
    • AI Table Reader is an Extract Method. It executes during primary extraction when Extract runs.
    • AI Extract is a Fill Method. It executes after primary extraction finishes when Extract runs.
    • This can be helpful for designers wanting to test AI table extraction discretely without running a whole AI Extract operation on a Data Model.
  • If results are incomplete, narrow scope using "Included Columns" and refine "Instructions" (e.g., expected formats).
  • If rows misalign, adjust "Row Alignment", enable "Detect Header"/"Detect Footer", or configure "Multiline Rows".
  • If prompts are too large/noisy, adjust "Document Quoting" to target only the relevant region.
  • Use several sample documents with varying layouts to validate robustness.
  • If the table is empty, verify the LLM connector, reduce quote scope, or strengthen "Instructions".
  • For misaligned rows, adjust "Row Alignment", enable header/footer detection, and test "Multiline Rows".
  • Keep "Included Columns" minimal at first; expand after confirming core columns are reading correctly.
  • Review diagnostics:
    • Chat Log.json: Shows prompts and responses.
    • Response Data.json: Shows raw JSON the LLM returned.

Example: Setting up AI Table Reader

  1. Check the Options property on the Grooper Root Node, and be sure to verify there is an established LLM Connector.
  2. Select the "AI Table Reader" Data Table from the provided Project in the Node Tree.
    • You may note that it has child Data Columns.
  3. Click the dropdown for the Extract Method property, and select "AI Table Reader" from the drop-down menu.
  4. Click the drop-down arrow to the left of the Extract Method properties to expand its sub-properties, then do the same for the Generator property.
  5. Click the "..." button to the right of the Model property to open the Model editor.
  6. Select an LLM model in the Model window.
    • Extraction results will vary depending on the chosen model.
  7. Click the "..." button to the right of the Included Columns property to open the Included Columns editor.
  8. Select the columns you want to perform extraction for.
    • Depending on the size of the Data Table, you may want to start by selecting a few to test results, then expand your selections as your tests prove fruitful.
  9. Click the "..." button to the right of the Instructions property to open the Instructions editor.
  10. Add useful instructions to allow for the best extraction results.
    • The quality of your instructions will prove to play a large role in the effectiveness of the extraction results.
  11. Click the "Save" button to save the changes made, then click the Tester tab.
  12. Click the "Select Batch" button in the Batch Viewer, then be sure to select the provided "AI-Table-Reader" Batch.
  13. Select the first Batch Folder in the Batch Viewer, then click the "Test Extraction" button.
  14. Notice all results are returned.
  15. Select the second Batch Folder in the Batch Viewer, and notice some columns have no data in some cells.
  16. Click the "Test Extraction" button.
  17. Notice all results are returned, including the missing cells.
  18. Select the third Batch Folder in the Batch Viewer, and notice this table has no lines.
  19. Click the "Test Extraction" button.
  20. Notice all results are returned.
  21. Select the fourth Batch Folder in the Batch Viewer, and notice the columns in this table are ordered differently than the other documents and that of the model.
  22. Click the "Test Extraction" button.
  23. Notice all results are returned, in spite of the different ordering of the columns.
  24. Click the "Diagnostics" button.
  25. Here you can view the different JSON files that represent the data sent to, and received from the LLM. This can be useful for troubleshooting.

Instructions used in this example are below:

This table contains order-level transactional data.
Extract one row per order from the quoted document text.
For each row, return exactly one value per column, mapped to the schema below.
Do not merge rows, skip rows, or infer values that are not explicitly present in the document.

Column definitions:
OrderDate – The date the order was placed. Return in the same format as shown in the document.
OrderId – The unique identifier for the order.
Salesperson – The name of the salesperson associated with the order.
Units – The number of units ordered (numeric).
OrderAmount – The total monetary amount of the order (numeric, exclude currency symbols).

If a cell value is missing or blank in the source table, return null for that cell or leave it blank.

Properties overview

Below is a property reference for AI Table Reader (the table’s Extract Method). Property names are shown as they appear in the UI.

  • Generator
    • Definition: Selects the data generator component used to communicate with the LLM.
    • Remarks: The generator builds the chat request using your schema and messages, runs the completion, and returns structured data.
    • Purpose/Use case: Choose the generator that matches your LLM connector and desired structured output behavior.
  • Included Columns
    • Definition: Limits extraction to a subset of Data Columns.
    • Remarks: Reduces prompt size and narrows focus. If left empty, all visible, non‑computed descendant Data Columns are included.
    • Purpose/Use case: Improve accuracy and performance by extracting only needed columns (e.g., Invoice Date and Total Amount).
  • Instructions
    • Definition: Optional guidance for the LLM appended to the prompt.
    • Remarks: Clarify expected formats, business rules, or special handling (e.g., "Extract dates in MM/DD/YYYY" or "Ignore handwritten notes").
    • Purpose/Use case: Boost accuracy with concise, explicit instructions tailored to the document type.
  • Document Quoting
    • Definition: The quoting method used to select and format document content provided to the LLM.
    • Remarks: Controls scope and formatting of quoted text (e.g., region‑based, layout‑aware). If blank, the entire input region is quoted with a standard prefix.
    • Purpose/Use case: Reduce noise, focus the LLM on relevant text, and improve consistency across varied layouts.
  • Row Alignment
    • Definition: Controls how AI‑extracted rows align to document lines or regions.
    • Remarks: When not "None", additional options become visible (header/footer detection and multiline rows).
    • Purpose/Use case: Improve mapping accuracy by aligning the AI’s rows with page content.
  • Detect Header
    • Definition: Instructs the LLM to detect the table header.
    • Remarks: Visible when "Row Alignment" is not None.
    • Purpose/Use case: Helps anchor row alignment and column placement for documents with recognizable header lines.
  • Detect Footer
    • Definition: Instructs the LLM to detect the table footer row.
    • Remarks: Visible when "Row Alignment" is not None.
    • Purpose/Use case: Establishes the table end and optionally supports footer value capture/validation downstream.
  • Multiline Rows
    • Definition: Settings to handle multi‑line row content.
    • Remarks: Visible when "Row Alignment" is not None. Allows wrapped or stacked line items to be treated as a single logical row.
    • Purpose/Use case: Ensure row completeness for descriptions or notes that span multiple lines.
  • Alignment
    • Definition: Default alignment behaviors used when mapping extracted values into rows/cells.
    • Remarks: Applies standard alignment preferences that influence geometric mapping of the AI’s JSON to document rows.
    • Purpose/Use case: Provide consistent alignment defaults to reduce per‑table fine‑tuning.
  • Please see Parameters article for information on parameter settings.

Related table and column properties

While not specific to AI Table Reader, these Data Table and Data Column properties often affect configuration and results:

  • Data Table "Extract Method"
    Select AI Table Reader to enable AI‑based extraction.
  • Data Table "Initial Row Count"
    Initialize a minimum number of rows in new instances. Useful for forms that always display N rows.
  • Data Table "Row Count Range"
    Enforce min/max rows for validation.
  • Data Table "Dynamic Column Ordering"
    Display columns based on detected document order.
  • Data Table "Generate Footer Row"
    Create a blank footer row if the Extract Method does not capture one.
  • Data Column "Header Extractor"
    Optionally find column header labels for alignment and post‑processing.
  • Data Column "Propagation"
    Fill empty cells from nearest values above/below.