Delimited Extract (Table Extract Method): Difference between revisions

From Grooper Wiki
No edit summary
No edit summary
 
Line 8: Line 8:
|
|
You may download the ZIP(s) below and upload it into your own Grooper environment (version 2025). The first contains one or more Batches of sample documents.  The second contains one or more Projects with resources used in examples throughout this article.  
You may download the ZIP(s) below and upload it into your own Grooper environment (version 2025). The first contains one or more Batches of sample documents.  The second contains one or more Projects with resources used in examples throughout this article.  
* [[Media:2025_Wiki_Delimited Extract.zip]]
* [[Media:2025_Wiki_Delimited-Extract_Batch.zip]]
* [[Media:2025_Wiki_Delimited Extract.zip]]
* [[Media:2025_Wiki_Delimited-Extract_Project.zip]]
|}
|}



Latest revision as of 14:36, 18 December 2025

This article is about the current version of Grooper.

Note that some content may still need to be updated.

2025 20232021

Delimited Extract is a Table Extract Method that extracts tabular data from a delimiter-separated text file, such as a CSV file.

You may download the ZIP(s) below and upload it into your own Grooper environment (version 2025). The first contains one or more Batches of sample documents. The second contains one or more Projects with resources used in examples throughout this article.

Introduction

Delimited Extract is a Table Extract Method that reads and parses delimited text files (for example CSV, TSV, or pipe-delimited) and maps fields into a Data Table's Data Columns. Instead of reading from page layout or OCR, it ingests a text file attached to the corresponding Batch Folder and converts each parsed line into a table row.

How it differs from other table extract methods:

  • Tabular Layout / Grid Layout operate on page layout, headers, and geometric alignment; Delimited Extract does not use document layout.
  • Row Match and Fixed Width parse text in the document region; Delimited Extract reads from an external delimited file.
  • Fluid Layout switches between Tabular Layout and Row Match based on labels; Delimited Extract is purpose-built for delimited files.
  • AI Table Reader uses a large language model to interpret semi-structured tables; Delimited Extract is rules-based, deterministic parsing of delimited inputs.

What it is for

Use Delimited Extract when your tabular data arrives as a delimited text file (e.g., exports, system feeds, or data interchanges) stored on the Batch Folder. Typical scenarios:

  • Accounting exports: InvoiceNo,Date,Amount
  • System logs or reports delivered as CSV/TSV
  • Partner data feeds using pipe (|) or semicolon delimiters

Benefits

  • Simple, deterministic parsing for consistent, well-formed files
  • Flexible column mapping: by header name, by position, or via a mapping lexicon
  • Robust quoting and optional whitespace trimming
  • Minimal configuration for standard CSV

Drawbacks

  • Requires a delimited file stored on the Batch Folder; it does not analyze page layout or OCR
  • Sensitive to delimiter and quoting configuration—misconfiguration leads to misaligned columns
  • No label-based header/footer detection (unlike page-layout methods)

How to add and configure Delimited Extract

Following are instructions for general setup of the Delimited Extract Table Extract Method.

FYI

Please see the demos below for example setups with screenshots and highlighted instructions.

  1. Create or open a Data Model and add a Data Table for your repeating records (for example, Line Items).
  2. Add one or more Data Columns under the Data Table (for example, Name, Date, Amount). Name the columns exactly as you want them to appear in the output.
  3. Select the Data Table and set its "Extract Method" to Delimited Extract.
  4. Configure Delimited Extract properties:
    • Choose the correct "Delimiter" (for example, ,, \t, or |).
    • Decide the "Mapping Method":
      • Named if the file has a header row whose values match your Data Column names.
      • Positional if you want to map by column order.
      • Mapped if you need a lexicon to translate CSV headers (or numeric indices) to Data Column names.
    • If using Named or Mapped, confirm "Header Row" (1-based) and adjust "Skip Rows" if introductory lines should be ignored before data rows.
    • Enable or disable "Quoted" and "Trim Whitespace" to match your file's conventions.
    • Ensure "Filename Pattern" matches the file stored on the Batch Folder (default matches .csv).
  5. Attach or drop the delimited file onto the Batch Folder so it is available during extraction.
  6. Classify the document to a Content Type that uses the Data Model that contains the Delimited Extract configured Data Table.
  7. Test your setup:
    • Run a Batch through an Extract step or use the Tester tab of the Data Table or Data Model.
    • If values look misaligned, verify delimiter, header row, mapping strategy, and quoting settings.
    • Use diagnostics to review any parsing errors and row counts.

Tips

  • For TSV files set "Delimiter" to \t.
  • For pipe-delimited files set "Delimiter" to |.
  • If headers in the file do not exactly match your Data Column names, use Mapped and provide a mapping lexicon.
  • Use the "Convert to CSV" object command to convert XLSX files to CSV before attempting to use Delimited Extract on them.

Example: Excel Document - Convert to CSV

When working with CSV files it is common to come across XLSX files that are native to Microsoft Excel. Grooper cannot natively read XLSX files, so as a result, Delimited Extract will not work with this file type. It is easy enough, however, to convert XSLX files to CSV in Grooper. A simple object command will get the job done.

  1. XLSX is a common file type that might be expected to be used for Delimited Extract. Expand the Node Tree and select the provided "Delimited-Extract" Batch, then click the Viewer tab.
  2. Select the third Batch Folder.
    • You'll notice it is an XLSX file type. The XLSX file type cannot be natively read by Grooper, so it's data will seem incomprehensible.
  3. Right-click the XLSX document. In the pop-out menu hover over "Excel Document", then click the "Convert to CSV" object command.
  4. In the "Convert to CSV" window, click the drop-down button to the right of the Save As property. Select "Attachment" from the drop-down menu.
  5. "Children" and "Files" are also options for this command, so your choice may vary depending on your desired output. Click the "Execute" button when done.
  6. The XLSX file will be converted to a CSV and its contents are now viewable.

Example: Mapping Method - Named

The first option for the Mapping Method property is "Named". This allows the Delimited Extract Table Extract Method to read the header row, typically the first line, of a CSV file and attempt to match it to the names of Data Columns. If the names of the headers in the header row of the CSV match exactly the names of the Data Columns, Grooper will automatically know which entry to insert upon return of the data. This Mapping Method is very easy to configure and works will under the aformentioned circumstances, but breaks easily when there aren't exact matches. Adjustments to the names of Data Columns can be made to accommodate differences, however. A strength, however, of the "Named" Mapping Method is that the positions of the "columns" within the CSV have no bearing on its ability to function. As long as the names match, their order does not matter.

  1. Expand the Node Tree and select the "Delimited Extract - Named" Data Table from the provided Project. Notice the Extract Method is set to "Delimited Extract".
  2. Click the drop-down arrow to the left of the Extract Method property set to "Delimited Extract" to expand its sub-properties. Notice the Mapping Method property is set to "Named".
  3. Because the Header Row is on the first line, the Skip Rows property is set to "1" to skip that row for output. This will prevent the header row from being included in the returned results.
  4. Because our file uses commas to separate its values, we'll use the default Delimiter property setting of a comma.
  5. Click the Tester tab.
  6. Click the "Select Batch" button in the Batch Viewer, then be sure to select the "Delimited-Extract" Batch.
  7. In order to use Delimited Extract for extraction, we need to have a classified Batch Folder. Right-click the first Batch Folder, then click the "Assign Document Type" object command from the pop-out menu.
  8. Point the Content Type property at the "Delimited Extract - Document Type" Document Type.
  9. Select the now classified first Batch Folder, then click the "Test Extraction" button.
  10. Notice there is a partial output. Because the "Social" and "Animal" Data Columns names do not match that of the header row, the Named Mapping Method falls short here.
  11. Changing the names of the Data Columns is an option, but we'll leave them alone for now.

Example: Mapping Method - Positional

The second option for the Mapping Method property is "Positional". Unlike "Named", this Mapping Method does not rely on the header row of the CSV at all. Instead, the assumption is that the "columns" of the CSV file, from left to right, match exactly the "position" of the Data Columns from top to bottom. Like the "Named" Mapping Method, this is very easily configured, however, it breaks easily if the columns of the CSV file do not match the order of the Data Columns exactly. Changing the oder of Data Columns is easily done if you know that the CSV files being processed are consistent.

  1. Expand the Node Tree and select the "Delimited Extract - Positional" Data Table from the provided Project. Notice the Extract Method property is set to "Delimited Extract".
  2. Click the drop-down arrow to the left of the Extract Method property set to "Delimited Extract" to expand its sub-properties. Notice the Mapping Method property is set to "Positional".
  3. Because the Header Row is on the first line, the Skip Rows property is set to "1" to skip that row for output. This will prevent the header row from being included in the returned results.
  4. Because our file uses commas to separate its values, we'll use the default Delimiter property setting of a comma.
  5. Click the Tester tab.
  6. Click the "Select Batch" button in the Batch Viewer, then be sure to select the "Delimited-Extract" Batch.
  7. In order to use Delimited Extract for extraction, we need to have a classified Batch Folder. Right-click the second Batch Folder, then click the "Assign Document Type" object command from the pop-out menu.
  8. Point the Content Type property at the "Delimited Extract - Document Type" Document Type.
  9. Select the now classified second Batch Folder, then click the "Test Extraction" button.
  10. We're getting results, but because the first column in the source is an "ID" column, this ends up pushing our results. This shows a shortcoming of the "Positional" Mapping Method.

Example: Mapping Method - Mapped

The third and final option for the Mapping Method property is "Mapped". This is the most robust of the options available as you use key-value pair entries in a Lexicon, or the Local Entries sub-property of this option, to directly map the names of "columns" of the CSV file to Data Columns in Grooper. The number and position of columns within the CSV file will have no impact on this Mapping Method's ability to function.

When mapping entries for this Mapping Method the key, or the value to the left of the = sign, is the name of a Data Column in Grooper, while the value, or the value to the right of the = sign, is the name of the column found in the header row of the CSV. For example:

First Name=First Name
Last Name=Last Name
Email=Email
Social=SSN
Animal=Favorite Animal
  1. Expand the Node Tree and select the "Delimited Extract - Mapped" Data Table from the provided Project. Notice the Extract Method property is set to "Delimited Extract".
  2. Click the drop-down arrow to the left of the Extract Method property set to "Delimited Extract" to expand its sub-properties. Notice the Mapping Method property is set to "Mapped".
  3. Because the Header Row is on the first line, the Skip Rows property is set to "1" to skip that row for output. This will prevent the header row from being included in the returned results.
  4. Because our file uses commas to separate its values, we'll use the default Delimiter property setting of a comma.
  5. Click the drop-down arrow to the left of the Mappings property to expand its sub-properties, then click the ellipsis button to the right of the Local Entries property to open the Local Entries editor.
    • It's worth noting that you could use the Included Lexicons property to point at a Lexicon instead of using Local Entries.
  6. The entries used for mapping are key-value pairs, where the key represents the name of the Data Column, and the value represents the name of the target column from the CSV file.
  7. Click the Tester tab.
  8. Click the "Select Batch" button in the Batch Viewer, then be sure to select the "Delimited-Extract" Batch.
  9. Select the first (classified) Batch Folder in the Batch Viewer, then click the "Test Extraction" button.
  10. Notice all results are returned accurately.
  11. Select the second (classified) Batch Folder in the Batch Viewer, then click the "Test Extraction" button.
  12. Notice all results are returned accurately.

Properties overview

The following properties are available for Delimited Extract. Property names are shown in quotes as they appear in the UI.

  • Filename Pattern
    • Definition: A regular expression that selects which file on the Batch Folder to parse (default: \.csv$).
    • Remarks: Only files whose names match this pattern are processed. Use standard .NET regex.
    • Use cases:
      • Match TSV: \.tsv$
      • Match files containing “export”: export.*\.csv$
  • Mapping Method
    • Definition: Chooses how CSV columns map to Data Columns.
    • Remarks: Options:
      • Named — match CSV header names to Data Column names.
      • Positional — map by order: CSV column 1 → first Data Column, and so on.
      • Mapped — use a lexicon to translate header text or numeric indices to Data Column names.
    • Use cases:
      • Use Named when headers match your column names.
      • Use Positional when there is no header row or order is reliable.
      • Use Mapped when headers differ, have synonyms, or you want to reference numeric indices.
  • Mappings (shown when "Mapping Method" = Mapped)
    • Definition: An embedded lexicon that defines custom column mappings.
    • Remarks: Each entry maps a table column name (key) to a CSV header or a numeric index (value). If the value is a number, it is treated as a zero-based index; otherwise, the CSV header row is searched for that header text.
    • Use cases:
      • Handle synonyms, inconsistent header names, or index-based mapping.
      • Example:
Name=Full Name
Date=Transaction Date
Total=Amount Due
  • Header Row (1-based; hidden when "Mapping Method" = Positional)
    • Definition: The row number that contains the column names for header-based mapping.
    • Remarks: If blank, row 1 is used by default. Ignored for Positional mapping.
    • Use cases:
      • Files with a title row above the header require "Header Row" > 1.
  • Skip Rows
    • Definition: Number of rows to skip at the beginning of the file before parsing data.
    • Remarks: Applied after processing the header (if applicable). Default is 0.
    • Use cases:
      • Ignore banner rows or multi-line descriptions preceding data.
  • Delimiter
    • Definition: The character (or string) that separates fields (default: ,).
    • Remarks: Examples: comma (,), tab (\t), pipe (|).
    • Use cases: Required for correct parsing; must match the file's delimiter.
  • Quoted
    • Definition: Specifies whether fields may be enclosed in double quotes to escape delimiters (default: true).
    • Remarks: Enable for standard CSV where fields like "ACME, Inc." should be treated as one value.
    • Use cases: Disable if the file never uses quotes.
  • Trim Whitespace
    • Definition: Removes leading and trailing whitespace from parsed fields (default: true).
    • Remarks: Helps normalize values; disable if whitespace is significant.
    • Use cases: Keep enabled for most CSV/TSV data to reduce cleanup.

Configuration patterns and troubleshooting

  • Files not found: Confirm "Filename Pattern" matches the file in the Batch Folder.
  • Wrong columns: Verify "Delimiter", "Quoted", and "Trim Whitespace" align with the file's format.
  • Header mismatches: With "Mapping Method" = Named, ensure the file's header text exactly matches your Data Column names; otherwise use Mapped.
  • Mapped indices off by one: Mapped numeric values are zero-based indices; confirm positions accordingly.
  • Extra leading rows: Increase "Skip Rows" to start at the first data line after the header row.