Delimited Extract (Table Extract Method)
Delimited Extract is one of the Extract Method options for Data Table objects in a Data Model. This method extracts tabular data from a delimiter-separated text file, such as a CSV file.
This is the fastest, simplest and most effective method of extracting data from character delimited files, such as comma delimited CSV files or TXT files delimited by commas or other characters.
FYI |
Delimited Extract is new to version 2021. In older versions, this functionality was achieved with the CSV Extract table extraction method. Delimited Extract expands the CSV Extract functionality to extract tabular data from TXT files that delimiter characters besides a comma, such as semicolons or pipe characters. |
About
A delimited text file is, effectively, a table of information saved to a simple text format. Rows in the table are easy to represent in a text file. Each row is a new line in the file. Columns are little trickier. They are represented by a "delimiter" character. This character is used to define where the table columns are in the text file.
For example, CSV (comma-separated values) files use a comma (,
) as a delimiter. When opened in a spreadsheet program, such as Microsoft Excel or Google Sheets, these files look like standard spreadsheets. So, a CSV file with a table with some generic information collected for individuals might look something like the table below.
First Name | Last Name | SSN | Favorite Animal | |
Aaron | Aaronson | aaron@example.com | 987-65-4321 | aardvark |
Billy | Billiams | 999-88-7777 | billy goat | |
Cathy | Catherton | cat@example.com | 900-00-1234 | dog |
But a delimited file is just text data. There are no lines dividing the rows and columns. Instead, the vertical line boundaries between each column would be replaced with the delimiter character (a ,
for CSV files) and each row would be a new line in the file. So, it would look like the text below.
First Name,Last Name,Email,SSN,Favorite Animal
Aaron,Aaronson,aaron@example.com,987-65-4321,aardvark
Billy,Billiams,,999-88-7777,billy goat
Cathy,Catherton,cat@example.com,900-00-1234,dog
TXT files can also be formatted as delimiter-separated tables. They may use a comma (,
) or a different separator such as a semicolon (;
) or a "pipe" character (|
). For example, a "pipe-delimited" TXT file version of the data above would look like the text below.
First Name|Last Name|Email|SSN|Favorite Animal
Aaron|Aaronson|aaron@example.com|987-65-4321|aardvark
Billy|Billiams||999-88-7777|billy goat
Cathy,Catherton|cat@example.com|900-00-1234|dog
The only difference is one file is a CSV file, using a comma (,
) to distinguish the columns and the other is a TXT file, using the pipe character (|
).
The Delimited Extract table extraction method will parse these kinds of files and return their data to a Data Table object in a Data Model.
|
⚠ |
Currently, Grooper does not support "tabs" as a delimiter character. At the time of writing this article, the Delimited Extract method cannot extract data from a tab-delimited TXT file. Alternate Extract Methods, such as the Row Match method, must be used instead. |
How To
Prereqs
⚠ |
If you are testing Delimited Extract in Grooper Design Studio and are not returning any results or receiving an error message, read this first. This will address the most common issues when testing the Delimited Extract method. |
There are three critical perquisites in order for the Delimited Extract method to work properly, whether testing its configuration in Grooper Design Studio or collecting the data during the Extract activity of a Batch Process.
- The Batch Folder must have an attached character-delimited file of a CSV or TXT format.
- Excel files (XLSX) are capable of utilizing the Delimited Extract method. However, they must be converted to a CSV format first.
- For more information on converting Excel files to CSV files, visit the Microsoft Office Integration article.
- If imported sparsely, the Batch Folder's attached CSV or TXT file must be loaded prior to extraction.
- The Batch Folder must be classified.
1. The Batch Folder must have an attached file of a CSV or TXT format.
When you import digital files into Grooper (ie PDFs, Word documents, Excel spreadsheets, TXT files, etc), they are stored as attachments on the created Batch Folder in the Batch. When inspecting a Batch, the attached files are listed at the bottom of the Batch Folder icon.
|
|
From Grooper Design Studio , you can convert a single XLSX file quickly and easily if you're trying to test out your Data Table's Delimited Extract configuration. You can perform the following steps from the Batch Viewer at any node in the Node Tree. If you can get to a Batch Viewer, and select a batch using the node's UI, you can do the following to convert an Excel file to CSV.
|
|
|
FYI |
This is a very "ad hoc" method of converting Excel files to CSV files. You can also perform this conversion using the Execute activity, selecting Excel Document as the Object Type and choosing Excel to CSV as the Command. You will use the Execute activity to automate converting Excel files in a Batch Process. For more information, visit the Microsoft Office Integration article. |
2. If imported sparsely, the Batch Folder's attached CSV or TXT file must be loaded prior to extraction.
Digital files are imported into Grooper either "sparsely" or "fully".
- Fully imported files are fully loaded into the Grooper Repository. Their metadata properties (file name and extension, file dimensions, MIME type information, etc) are stored in the Grooper Database and their content (the full file itself) is copied to the Grooper File Store.
- Sparsely imported files are only partially loaded. Their metadata properties are stored in the Grooper Database, but their content is not copied to the Grooper File Store. Instead the content is accessed with a link to the file's location (for example, if you import a file from your Windows hard drive, it would be its Windows folder location).
Typically, a sparsely imported document is described as a "usable" document in Grooper. You can classify it. You can extract data from it. You can apply Activities to it and it will behave as if it were fully loaded. Just instead of using a local copy in the Grooper File Store to process the document, Grooper travels the document link to access the file.
However, this is not the case for Delimited Extract. The attached file must be loaded (copied into the Grooper File Store) in order to extract any data.
|
|
From Grooper Design Studio , you can load file quickly and easily if you're trying to test out your Data Table's Delimited Extract configuration. You can perform the following steps from the Batch Viewer at any node in the Node Tree. If you can get to a Batch Viewer, and select a batch using the node's UI, you can do the following to load a Batch Folder's attached file (be it a CSV file, a TXT file, an XLSX file, a PDF file, or any other file).
|
|
|
FYI |
This too is a very "ad hoc" method of loading files. You do the same thing using the Execute activity, selecting File System Link as the Object Type (or CMIS Document Link, FTP Link, Mail Link or one of the other Link options depending how the file was imported from what external storage platform) and choosing Load Content as the Command. You will use the Execute activity to automate foading files in a Batch Process. |
3. The Batch Folder must be classified.
When building a Batch Process you will always add an Extract activity after a Classify activity or some other operation classifies the Batch Folders in a Batch. This is because the Extract activity looks for the Content Type assigned to the document folder to access the Data Model configured to collect data from the document.
Typically, document folders do not need to be classified when performing unit testing in Grooper Design Studio. You can navigate to a Data Model or its child Data Elements in the Node Tree, select an unclassified Batch Folder object in a Batch, press the "Test Extraction" button and a preview of the extracted data will be presented in the UI.
This is not the case when testing the Delimited Extract method! The Batch Folder must be classified (assigned a Document Type) in order to test extraction.
|
|
|
Delimited Extract Configuration
Enable Delimited Extract
To use Delimited Extract you will choose this as the Extract Method property of a Data Table
|
Choose the Mapping Method
With Delimited Extract enabled, arguably the most important consideration is how you configure the Mapping Method property. This property determines how the "columns" in the delimiter-separated file correspond to the Data Columns in your Data Table. This can be one of three options:
- Named
- Positional
- Mapped
Named
The Named option will use column headers in the delimiter-separated file to map columns to the Data Columns in our Data Table. As long as the names match, Grooper will populate the Data Column with the data in the correspondingly named column in the CSV or TXT file.
For example, the Data Table in our example has five Data Columns.
|
|
The CSV files in our batch also have five columns. |
|
However, some of the names are the same and some are different. Only the "First Name", "Last Name" and "Email" columns match. The "Social" and "Animal" Data Columns are labeled "SSN" and "Favorite Animal" in the CSV file. |
If we were to test extraction of this CSV file using the Named method, only part of the delimiter-separated values will be collected. Those in the columns whose names match will be returned, but the columns whose names do not will not.
If we changed the name of our Data Columns to match (both be named "SSN" and "Favorite Animal" respectively), we would get the full table populated. However, let's completely ignore that and look at the next Mapping Method option, Positional. |
FYI |
|
Positional
The document from the previous example is actually a good candidate for the Positional Mapping Method.
This method does not rely on how columns are named but instead how they are ordered. It presumes the first column in the delimiter-separated file is the Data Table's first child Data Column.