2023:Delimited Extract (Table Extract Method)

From Grooper Wiki
Revision as of 11:01, 26 August 2024 by Randallkinard (talk | contribs)

This article is about an older version of Grooper.

Information may be out of date and UI elements may have changed.

202520232021

The Delimited Extract Table Extract Method extracts tabular data from a delimiter-separated text file, such as a CSV file.

This is the fastest, simplest and most effective method of extracting data from character delimited files, such as comma delimited CSV files or TXT files delimited by commas or other characters.

About

You may download and import the file(s) below into your own Grooper environment (version 2023). There is a Batch with the example document(s) discussed in this tutorial, as well as a Project configured according to its instructions.
Please upload the Project to your Grooper environment before uploading the Batch. This will allow the documents within the Batch to maintain their classification status.


A delimited text file is, effectively, a table of information saved to a simple text format. Rows in the table are easy to represent in a text file. Each row is a new line in the file. Columns are little trickier. They are marked by a "delimiter" character. This character is used to define where the table columns are in the text file.

For example, CSV (comma-separated values) files use a comma (,) as a delimiter. When opened in a spreadsheet program, such as Microsoft Excel or Google Sheets, these files look like standard spreadsheets. So, a CSV file with a table with some generic information collected for individuals might look something like the table below.

First Name Last Name Email SSN Favorite Animal
Aaron Aaronson aaron@example.com 987-65-4321 aardvark
Billy Billiams 999-88-7777 billy goat
Cathy Catherton cat@example.com 900-00-1234 dog

But a delimited file is just text data. There are no lines dividing the rows and columns. Instead, the vertical line boundaries between each column would be replaced with the delimiter character (a , for CSV files) and each row would be a new line in the file. So, it would look like the text below.

First Name,Last Name,Email,SSN,Favorite Animal
Aaron,Aaronson,aaron@example.com,987-65-4321,aardvark
Billy,Billiams,,999-88-7777,billy goat
Cathy,Catherton,cat@example.com,900-00-1234,dog

TXT files can also be formatted as delimiter-separated tables. They may use a comma (,) or a different separator such as a semicolon (;) or a "pipe" character (|). For example, a "pipe-delimited" TXT file version of the data above would look like the text below.

First Name|Last Name|Email|SSN|Favorite Animal
Aaron|Aaronson|aaron@example.com|987-65-4321|aardvark
Billy|Billiams||999-88-7777|billy goat
Cathy,Catherton|cat@example.com|900-00-1234|dog

The only difference is one file is a CSV file, using a comma (,) to distinguish the columns and the other is a TXT file, using the pipe character (|).


The Delimited Extract table extraction method will parse these kinds of files and return their data to a Data Table object in a Data Model.

  1. For example, here we have a Data Table named "People Data" configured to extract tabular data from a CSV file.
  2. You can see the attached file to this Batch Folder is a CSV file named "PEOPLE_DATA_COMMA.csv".
  3. In the Document Viewer, we can see this is a simple CSV file with the data described above.
    • Each column ("First Name", "Last Name", "Email", "SSN", and "Favorite Animal") is separated by a single , character.
  4. The Data Table's Extract Method property is set to Delimited Extract.
  5. The Delimited Extract sub-properties are configured according to the extracted file's type and delimited format.
    • These properties dictate what the imported document's filetype is (i.e. CSV or TXT), what delimiter character is used, how the data is mapped to the Data Table object's Data Columns.
    • These settings will be discussed in more depth, in the How To section of this article.
  6. Upon extraction, the Data Table returns the CSV file's data.

Currently, Grooper does not support "tabs" as a delimiter character. At the time of writing this article, the 'Delimited Extract method cannot extract data from a tab-delimited TXT file. Alternate Extract Methods, such as the Row Match method, must be used instead.

How To

Prereqs

There are three critical perquisites in order for the Delimited Extract method to work properly, whether testing its configuration in Design or collecting the data during the Extract activity of a Batch Process.

  1. The Batch Folder must have an attached character-delimited file of a CSV or TXT format.
    • Excel files (XLSX) are capable of utilizing the Delimited Extract method. However, they must be converted to a CSV format first.
    • For more information on converting Excel files to CSV files, visit the Microsoft Office Integration article.
  2. If imported sparsely, the Batch Folder's attached CSV or TXT file must be loaded prior to extraction.
  3. The Batch Folder must be classified.

1. Attached File Considerations

THE BATCH FOLDER MUST HAVE AN ATTACHED FILE OF A CSV OR TXT FORMAT

When you import digital files into Grooper (ie PDFs, Word documents, Excel spreadsheets, TXT files, etc), they are stored as attachments on the created Batch Folder in the Batch. When inspecting a Batch, the attached files are listed at the bottom of the Batch Folder icon.

  1. The documents here have CSV and a character-delimited TXT file attached to the Batch Folders. These documents are suitable for Delimited Extract.
  2. Microsoft Excel files can also take advantage of Delimited Extract. However, they must first be converted to the CSV file format.
  3. Excel files (XLSX) are not natively readable by Grooper. Without converting them to a CSV file (which is natively readable), you will see this message in the document viewer.


From Design , you can convert a single XLSX file quickly and easily if you're trying to test out your Data Table's Delimited Extract configuration. You can perform the following steps from the Batch Viewer at any node in the Node Tree. If you can get to a Batch Viewer, and select a batch using the node's UI, you can do the following to convert an Excel file to CSV.

  1. Right-click the document folder with the attached Excel file.
  2. Select "Excel Document".
  3. Select "Convert to CSV..."
  4. This will bring up the following "Convert to CSV" configuration window.
  5. Using the Save As property choose how you want to convert the file.
    • Children - Each worksheet will be saved as a child document.
    • Files - Each worksheet wil be saved as a named file.
    • Attachment - The Excel attachment will be replaced with a .csv file containing all worksheets.
  6. Press the "Execute" button to convert the Excel file to a CSV file.


  1. Since we chose the Children option, the attached file is created as .csv child object to the parent .xlsx document.
  2. CSV files are natively readable by Grooper. Now we can view the file in the Document Viewer and Grooper can process the document.

FYI

This is a very "ad hoc" method of converting Excel files to CSV files. You can also perform this conversion using the Execute activity, selecting Excel Document as the Object Type and choosing Excel to CSV as the Command. You will use the Execute activity to automate converting Excel files in a Batch Process. For more information, visit the Microsoft Office Integration article.

2. Load the File

IF IMPORTED SPARSELY, THE BATCH FOLDER'S ATTACHED CSV OR TXT FILE MUST BE LOADED PRIOR TO EXTRACTION.

Digital files are imported into Grooper either "sparsely" or "fully".

  • Fully imported files are fully loaded into the Grooper Repository. Their metadata properties (file name and extension, file dimensions, MIME type information, etc) are stored in the Grooper Database and their content (the full file itself) is copied to the Grooper File Store.
  • Sparsely imported files are only partially loaded. Their metadata properties are stored in the Grooper Database, but their content is not copied to the Grooper File Store. Instead the content is accessed with a link to the file's location (for example, if you import a file from your Windows hard drive, it would be its Windows folder location).

Typically, a sparsely imported document is described as a "usable" document in Grooper. You can classify it. You can extract data from it. You can apply Activities to it and it will behave as if it were fully loaded. Just instead of using a local copy in the Grooper File Store to process the document, Grooper travels the document link to access the file.

However, this is not the case for Delimited Extract. The attached file must be loaded (copied into the Grooper File Store) in order to extract any data.


  1. A sparsely imported file will be indicated by a gold chain link on the Batch Folder icon.
  2. If you attempt to extract the data using Delimited Extract, nothing will return at this point.


From Design , you can load file quickly and easily if you're trying to test out your Data Table's Delimited Extract configuration. You can perform the following steps from the Batch Viewer at any node in the Node Tree. If you can get to a Batch Viewer, and select a batch using the node's UI, you can do the following to load a Batch Folder's attached file (be it a CSV file, a TXT file, an XLSX file, a PDF file, or any other file).


  1. Right-click the document folder with the attached file.
  2. Select "File System Link".
  3. Select "Load Content..."
  4. This will bring up the following "Load Content" configuration window.
  5. Press the "Execute" button to load the file into the Grooper File Store.


  1. Upon loading the content, the chain link will change color from grey to silver.
    • The silver link indicates the content is loaded but Grooper retains a link to the file's location. This link can be used in the future if the content needs to be reloaded or otherwise accessed from its external location. Alternatively the link can be removed if no longer needed once the content is fully loaded into Grooper.
  2. With the content loaded, the data can now be extracted using Delimited Extract.

FYI

This too is a very "ad hoc" method of loading files. You do the same thing using the Execute activity, selecting File System Link as the Object Type (or CMIS Document Link, FTP Link, Mail Link or one of the other Link options depending how the file was imported from what external storage platform) and choosing Load Content as the Command. You will use the Execute activity to automate loading files in a Batch Process.

3. Classify the Document Folder

THE BATCH FOLDER MUST BE CLASSIFIED.

When building a Batch Process you will always add an Extract activity after a Classify activity or some other operation classifies the Batch Folders in a Batch. This is because the Extract activity looks for the Content Type assigned to the document folder to access the Data Model configured to collect data from the document.

Typically, document folders do not need to be classified when performing unit testing in Design. You can navigate to a Data Model or its child Data Elements in the Node Tree, select an unclassified Batch Folder object in a Batch, press the "Test Extraction" button and a preview of the extracted data will be presented in the UI.


This is not the case when testing the Delimited Extract method! The Batch Folder must be classified (assigned a Document Type) in order to test extraction.


  1. These documents are not classified. They do not have a Document Type assigned to them. They are generically listed as "Document".
  2. Upon testing the Data Table (using the Delimited Extract method), no data returns.


  1. These documents are classified. They have either the "CSV Report" or "TXT Report" Document Type (added to this Content Model) assigned to them.
  2. As a classified document, upon testing the Data Table (using the Delimited Extract method), the table data is returned.

Delimited Extract Configuration

Enable Delimited Extract

To use Delimited Extract you will choose this as the Extract Method property of a Data Table

  1. Add or select a Data Table in the Node Tree.
    • We've selected the Data Table named "People Data". It is a Data Element of the "CSV Report" Document Type's Data Model.
  2. Select the Extract Method property.
  3. Choose Delimited Extract from the list of table extraction methods.

Choose the Mapping Method

With Delimited Extract enabled, arguably the most important consideration is how you configure the Mapping Method property. This property determines how the "columns" in the delimiter-separated file correspond to the Data Columns in your Data Table. This can be one of three options:

  1. Named
  2. Positional
  3. Mapped
Named

The Named option will use column headers in the delimiter-separated file to map columns to the Data Columns in our Data Table. As long as the names match, Grooper will populate the Data Column with the data in the correspondingly named column in the CSV or TXT file.


For example, the Data Table in our example has five Data Columns.

  • First Name
  • Last Name
  • Email
  • Social
  • Animal


The CSV files in our batch also have five columns.


However, some of the names are the same and some are different.

Only the "First Name", "Last Name" and "Email" columns match.

The "Social" and "Animal" Data Columns are labeled "SSN" and "Favorite Animal" in the CSV file.


If we were to test extraction of this CSV file using the Named method, only part of the delimiter-separated values will be collected. Those in the columns whose names match will be returned, but the columns whose names do not will not.

  1. The default Mapping Method option is Named
  2. Upon extraction, Grooper will search the first row of the file for the column header labels.
    • The column headers must be present on the first row in order for this method to work.
  3. If a column header matches the name of one of the Data Table's child Data Columns, that Data Column will return that column's data in the delimiter-separated file.
    • In our case, the names for the "First Name", "Last Name", and "Email" columns match. So that data is returned to the correspondingly named Data Column.
    • This is also order independent. If "Email" was the first Data Column instead of the third, it would still return the data from the "Email" column on the document because their names still match.
  4. The "Social" and "Animal" Data Columns are blank because their names do not match.

If we changed the name of our Data Columns to match (both be named "SSN" and "Favorite Animal" respectively), we would get the full table populated.

However, let's completely ignore that and look at the next Mapping Method option, Positional.


FYI

Before we do that, you probably noticed the first line of the table (the one with the column headers) is returned along with the actual table data. This is probably "junk data" you don't ultimately want collected at time of extraction.

  1. This is what the Skip Rows property is for.
  2. Set this to 1 and it will skip the first row in the file.
  3. This will remove that header row from our extracted results, extracting only the second and beyond.

Positional

The document from the previous example is actually a good candidate for the Positional Mapping Method. This method does not rely on how columns are named but instead how they are ordered. It presumes the first column in the delimiter-separated file is the Data Table's first child Data Column.


Such is the case for the CSV file we looked at earlier. The first Data Column corresponds to the first column in the CSV file, the second corresponds to the second, the third the third, and so on.

In this case, it doesn't matter what they're named, just how the Data Columns are positioned in Grooper (child order from top to bottom) and how the columns are positioned in the delimiter-separated file (column order from left to right). If this lines up, you can just marry the Data Columns with the document's columns, returning the first column's data to the first Data Column and so on.

File:2023 Delimited-Extract 02 How-To 02 Delimited-Extract-Configuration 09.pngr


  1. Here, the Mapping Method property is set to Positional.
  2. In this case, the full table extracts successfully.
    • Even though the Data Column names and the column labels in the document do not match, their positions do. The first child Data Column logically matches the data in the left-most column on the document, the next Data Column matches the next column on the document, and so on.
    • The Positional Mapping Method simply returns the first column on the document to the first Data Column, the second the second, and so on.

FYI

Notice this method doesn't need a row of column labels at all. If your CSV or delimited TXT file does not have column labels, you can still use the Positional Mapping Method to collect the data.


The (probably obvious) drawback to this method is the positions must match all the time for every document. If the columns are out of order, or there are extra columns in the document, this Mapping Method' can start to break down.

  1. This document has an "ID" column as its first table column.
  2. Since this is now the first column, it maps directly to the first Data Column.
    • But that's not what we want! Now the column data is shifted. The first names are in the "Last Name" Data Column. The last names are in the "Email" Data Column and so on.

In this case, the last Mapping Method option, Mapped may be more suitable.


Mapped

The Mapped Mapping Method gives you the most control, dictating what Data Column matches which column in the delimited file. You can use a Lexicon or a Local Entries text editor and specific syntax to dictate this.

You will map the Data Column to the column label in the delimited file using the following format:

{Data Column name}={table column label name}

For example, our mappings will look like this:

Data Column names

table column labels


To enter these mappings, do the following.

  1. Change the Mapping Method property to Mapped.
  2. Expand the Mappings sub-properties.
  3. To add a local list of mappings, select the Local Entries property and press the ellipsis button at the end.
  4. This will bring up a list editor window.
  5. Using the syntax described above, enter a list of Data Column to table column label mappings.
  6. Press the "OK" button when finished.

FYI

You can also use the Included Lexicons property to point to a Lexicon object in the Node Tree instead of using a local list.

You will use the exact same syntax as described above in both cases.


Now, with these mappings listed out using the Mapped Mapping Method. Delimited Extract will successfully extract the data for both CSV files sharing the same column labels, but have those columns in different positional order.

  1. For this document...
  2. ...the listed mappings match up the columns labeled here...
  3. ...with their corresponding Data Columns, and the table's data is returned.


  1. This document extracts correctly as well!
  2. Even though these column labels have different positions (due to there being an additional "ID" column).
  3. The Mapped method ignores the unlisted "ID" column's data and the table is returned, mapping the listed columns with the listed Data Columns.


Filename Pattern and Delimiter

Prior to Grooper 2021, the Delimited Extract method was called CSV Extract. Given the name CVS Extract, it not surprisingly only extracted tabular data from CSV files. In version 2021, this functionality was expanded to include functionality for delimited TXT files as well as delimiter characters other than commas.


For example, in our example Batch we have a TXT file that uses a pipe delimiter (|). To get this delimited file's table to extract correctly, we will need to configure two additional properties.

  1. The Filename Pattern property
    • This will allow us to dictate what file format (or formats) are present and should be extracted.
  2. The Delimiter property
    • This will allow us to dictate what delimiter character is used in the delimiter-separated file (a comma, a pipe, a semicolon, etc).


  1. The Filename Pattern is a regular expression pattern that runs against the attached file's name. If the pattern returns a match, Delimited Extract executes and attempts to return data. If it doesn't match, no extraction is even attempted.
    • The default pattern is \.csv$. This will match any filename ending in ".csv", or in other words, any CSV file.
    • If we want Delimited Extract to return tabular data from delimited TXT files, we will need a different pattern.
      • The pattern \.txt$ would match any filename ending in ".txt". In other words, any TXT file.
      • FYI: What if have a mix of TXT and CSV files? The pattern (\.csv|\.txt)$ would match any filename ending in ".csv" or ".txt".
  2. The Delimiter property indicates what delimiter character the Delimited Extract method expects to find between column values.
    • The default character is a comma (,) because CSV files are "comma-separated".
    • If we want Delimited Extract to return data from a "pipe-delimited" file, we just need to enter a pipe character here (|).
  3. With these two properties adjusted, the pipe-delimited TXT file can be returned.


ONLY one delimiter may be used. If you have a mix of delimiter characters, you will likely need to take advantage of Grooper's hierarchical data modeling capabilities.

Such is actually the case here. We have two Document Types in this Content Model: "CSV Report" and "TXT Report". Each have their own Data Models with their own child Data Tables named "People Data". The "CSV Report" Document Type's Data Table uses the Delimited Extract method configured to return data from comma-delimited CSV files, while the "TXT Report" Document Type's Data Table uses the Delimited Extract method to return data from pipe-delimited TXT files.

(Alternatively, we could have created a single parent Data Model and taken advantage of a Document Type's Data Element Override properties. Both approaches accomplish the same task. The point is we would need to classify the different types of files as different Document Types and used different configurations of Delimited Extract, one applicable to CSV files and one to pipe-delimited TXT files.)

Glossary

Batch Folder: The folder Batch Folder is an organizational unit within a inventory_2 Batch, allowing for a structured approach to managing and processing a collection of documents. Batch Folder nodes serve two purposes in a Batch. (1) Primarily, they represent "documents" in Grooper. (2) They can also serve more generally as folders, holding other Batch Folders and/or contract Batch Page nodes as children.

  • Batch Folders are frequently referred to simply as "documents" or "folders" depending on how they are used in the Batch.

Batch Process: settings Batch Process nodes are crucial components in Grooper's architecture. A Batch Process is the step-by-step processing instructions given to a inventory_2 Batch. Each step is comprised of a "Code Activity" or a Review activity. Code Activities are automated by Activity Processing services. Review activities are executed by human operators in the Grooper user interface.

  • Batch Processes by themselves do nothing. Instead, they execute edit_document Batch Process Steps which are added as children nodes.
  • A Batch Process is often referred to as simply a "process".

Batch: inventory_2 Batch nodes are fundamental in Grooper's architecture. They are containers of documents that are moved through workflow mechanisms called settings Batch Processes. Documents and their pages are represented in Batches by a hierarchy of folder Batch Folders and contract Batch Pages.

Classify: unknown_document Classify is an Activity that "classifies" folder Batch Folders in a inventory_2 Batch by assigning them a description Document Type.

  • Classification is key to Grooper's document processing. It affects how data is extracted from a document (during the Extract activity) and how Behaviors are applied.
  • Classification logic is controlled by a Content Model's "Classify Method". These methods include using text patterns, previously trained document examples, and Label Sets to identify documents.

CMIS: CMIS (Content Management Interoperability Services) is open standard allowing different content management systems to "interoperate", sharing files, folders and their metadata as well as programmatic control of the platform over the internet.

Content Model: stacks Content Model nodes define a classification taxonomy for document sets in Grooper. This taxonomy is defined by the collections_bookmark Content Categories and description Document Types they contain. Content Models serve as the root of a Content Type hierarchy, which defines Data Element inheritance and Behavior inheritance. Content Models are crucial for organizing documents for data extraction and more.

Content Type: Content Types are a class of node types used used to classify folder Batch Folders. They represent categories of documents (stacks Content Models and collections_bookmark Content Categories) or distinct types of documents (description Document Types). Content Types serve an important role in defining Data Elements and Behaviors that apply to a document.

Data Column: view_column Data Columns represent columns in a table extracted from a document. They are added as child nodes of a table Data Table. They define the type of data each column holds along with its data extraction properties.

  • Data Columns are frequently referred to simply as "columns".
  • In the context of reviewing data in a Data Viewer, a single Data Column instance in a single Data Table row, is most frequently called a "cell".

Data Element: Data Elements are a class of node types used to collect data from a document. These include: data_table Data Models, insert_page_break Data Sections, variables Data Fields, table Data Tables, and view_column Data Columns.

Data Model: data_table Data Models are leveraged during the Extract activity to collect data from documents (folder Batch Folders). Data Models are the root of a Data Element hierarchy. The Data Model and its child Data Elements define a schema for data present on a document. The Data Model's configuration (and its child Data Elements' configuration) define data extraction logic and settings for how data is reviewed in a Data Viewer.

Data Table: A table Data Table is a Data Element specialized in extracting tabular data from documents (i.e. data formatted in rows and columns).

  • The Data Table itself defines the "Table Extract Method". This is configured to determine the logic used to locate and return the table's rows.
  • The table's columns are defined by adding view_column Data Column nodes to the Data Table (as its children).

Delimited Extract: The Delimited Extract Table Extract Method extracts tabular data from a delimiter-separated text file, such as a CSV file.

Document Type: description Document Type nodes represent a distinct type of document, such as an invoice or a contract. Document Types are created as child nodes of a stacks Content Model or a collections_bookmark Content Category. They serve three primary purposes:

  1. They are used to classify documents. Documents are considered "classified" when the folder Batch Folder is assigned a Content Type (most typically, a Document Type).
  2. The Document Type's data_table Data Model defines the Data Elements extracted by the Extract activity (including any Data Elements inherited from parent Content Types).
  3. The Document Type defines all "Behaviors" that apply (whether from the Document Type's Behavior settings or those inherited from a parent Content Type).

Document Viewer: The Grooper Document Viewer is the portal to your documents. It is the UI that allows you to see a folder Batch Folder's (or a contract Batch Page's) image, text content, and more.

Execute: tv_options_edit_channels Execute is an Activity that runs one or more specified object commands. This gives access to a variety of Grooper commands in a settings Batch Process for which there is no Activity, such as the "Sort Children" command for Batch Folders or the "Expand Attachments" command for email attachments.

Extract: export_notes Extract is an Activity that retrieves information from folder Batch Folder documents, as defined by Data Elements in a data_table Data Model. This is how Grooper locates unstructured data on your documents and collects it in a structured, usable format.

File Store: hard_drive File Store nodes are a key part of Grooper's "database and file store" architecture. They define a storage location where file content associated with Grooper nodes are saved. This allows processing tasks to create, store and manipulate content related to documents, images, and other "files".

  • Not every node in Grooper will have files associated with it, but if it does, those files are stored in the Windows folder location defined by the File Store node.

FTP: FTP is a connection option for cloud CMIS Connections. It connects Grooper to FTP directories for import and export operations.

Grooper Repository: A Grooper Repository is the environment used to create, configure and execute objects in Grooper. It provides the framework to "do work" in Grooper. Fundamentally, a Grooper Repository is a connection to a database and file store location, which store the node configurations and their associated file content. The Grooper application interacts with the Grooper Repository to automate tasks and provide the Grooper user interface.

Lexicon: dictionary Lexicons are dictionaries used throughout Grooper to store lists of words, phrases, weightings for Fuzzy RegEx, and more. Users can add entries to a Lexicon, Lexicons can import entries from other Lexicons by referencing them, and entries can be dynamically imported from a database using a database Data Connection. Lexicons are commonly used to aid in data extraction, with the "List Match" and "Word Match" extractors utilizing them most commonly.

Microsoft Office Integration: Grooper's Microsoft Office Integration allows the platform to easily convert Microsoft Word and Microsoft Excel files into formats that Grooper can read natively (PDF and CSV).

Node Tree: The Node Tree is the hierarchical list of Grooper node objects found in the left panel in the Design Page. It is the basis for navigation and creation in the Design Page.

Project: package_2 Projects are the primary containers for configuration nodes within Grooper. The Project is where various processing objects such as stacks Content Models, settings Batch Processes, profile objects are stored. This makes resources easier to manage, easier to save, and simplifies how node references are made in a Grooper Repository.

Row Match: The Row Match Table Extract Method uses regular expression pattern matching to determine a tables structure based on the pattern of each row and extract cell data from each column.

Table Extract Method: A Table Extract Method defines the settings and logic for a table Data Table to perform extraction. It is set by configuring the Extract Method property of the Data Table.