Email Processing

From Grooper Wiki
(Redirected from Expand Attachments)

This article is about the current version of Grooper.

Note that some content may still need to be updated.

2025

Grooper is capable of processing documents imported from email. Both email attachments and the email body itself can be processed just like any other imported document in Grooper, but there are a few differences in processing.

You may download the ZIP(s) below and upload it into your own Grooper environment (version 2025). The first contains one or more Batches of sample documents. The second contains one or more Projects with resources used in examples throughout this article.

About

Grooper can ingest email messages, condition them for further processing, and process the email's body and/or any attachments (just like any document in Grooper).

However, there are several considerations when processing documents that come in from an email source:

  1. Import considerations - Are you importing emails manually or do you want an Import Watcher service to periodically poll an import source for new emails coming in or bring them in at scheduled times?
  2. Attachment considerations - Does the email have attachments that need to be processed?
  3. Body considerations - Do you want to process the email body? If so, do you need to just process the body's text? Do you need to process the rendered HTML seen in an email client? Does the email have any images you need to process?
  4. Changes to the Batch Process - Based on your answers to questions 2 and 3, the Batch Process will need to be adjusted to accommodate the scenario.

How To

In the following sections you will learn the differences in importing and processing emails rather than importing documents through a different connection such as an NTFS.

Import considerations

When importing emails, it is recommended to use one of two CMIS Import providers: Import Descendants or Import Query Results.

Of the two, Import Query Results is more common for importing email messages. This article will focus on using this Import Provider.

Main import considerations

When configuring email import, there are three main considerations:

  • How do you want to connect to the email source? Grooper will use a "CMIS Connection" to do this.
  • Do you want to perform a user-directed (ad-hoc) import? Users will perform the import from the Imports Page.
  • Do you want to perform automated (scheduled) imports? The import will will be performed by an Import Watcher service.


Common secondary considerations

After resolving the main import considerations, ask yourself:

  • Are you going to filter the import by message properties like the sender or sent date or text in the subject line? This is where Import Query Results really shines. It will selectively import files that match criteria set by the search query. Searches are defined by a SQL-like query called a "CMISQL Query" that uses file metadata and folder hierarchies to set search parameters.
  • Are you going to dispose of the emails after importing them? If so, how? Disposing of imported files is particularly important for automated imports. Files can be "disposed" by deleting them, moving them, or updating their metadata.
  • Is "sparse import" right for you? Sparse imports can speed up import time. Instead of copying file content from the import source into the Grooper Repository, sparse import establishes links to file content when adding document folders to the Batch. However, there are other considerations besides import speed you will need to evaluate when performing a sparse import which we will expand on later in the article.

Creating a CMIS Connection

To import from your email, you will first need a CMIS Connection. A CMIS Connection is what Grooper uses to connect to external content management systems, including email clients. There are two "CMIS Connection Types" that can be used for email imports.

  • Exchange - Connects Grooper to Exchange email servers. This is used to connect to Outlook inboxes.
  • IMAP - Connects Grooper to any email client using the IMAP protocol.

The Exchange connection is both more common and more fully featured. For these reasons, we will focus on importing emails from an Outlook inbox in this article.

Step 1: Create a new CMIS Connection

  1. Right click a Project in your Grooper Repository.
  2. Select "Add", then "CMIS Connection..."
  3. In the pop-up window, name your CMIS Connection.
  4. Click "Execute" to finish.

Step 2: Configure the CMIS Connection

  1. Select the CMIS Connection and make sure you're on the "CMIS Connection" tab.
  2. Select the Connection Settings property and press its dropdown list button.
  3. Select "Exchange" from the list.
  4. Expand the Connection Settings properties and enter the Exchange server's host name or IP address in the Host Name property.
    • For Microsoft 365 Outlook users, enter outlook.office365.com
  5. Configure the Authentication Method you are using to log into the email client.
    • Exchange OAuth is the easiest and most common method.
  6. Use the Mailbox List editor to enter at least one mailbox (even if this is simply your own email address).

FYI

The Use Search Folder property will enable an Exchange "Search Folder". This will enhance the query capabilities of Import Query Results and the imported CMIS Repository's "Search" tab.

  • Grooper will automatically create a Search Folder named "Grooper Search" the first time a CMIS query is executed from Grooper.
  • The Search Folder will only be used when:
    • The content type being queried is "Message"
    • The query applies to the entire inbox (no IN_FOLDER or IN_TREE predicates are used)
    • And, the WHERE clause does not include a CONTAINS predicate.

Step 3: Tie the mailbox to Grooper by importing it as a CMIS Repository

  1. With the CMIS Connection selected and configured, click the "List Repositories" button in the upper right corner of the "Repositories" panel.
  2. Select a mailbox from the list.
  3. Click the "Import Repositories" button in the upper right corner of the "Repositories" panel.
  4. This will add a CMIS Repository object to the CMIS Connection (as a child node).
    • The CMIS Repository is a direct representation of the mailbox in Grooper.
    • Using the CMIS Repository, Grooper has total control to interact with messages, much like a user does in an email client.
    • The CMIS Repository is needed to configure the import provider used to import emails into Grooper.

User-directed (ad-hoc) email imports

User-directed (a.k.a "ad-hoc imports") imports are import jobs submitted manually by a user from the Imports Page. User-directed imports are useful for:

  • Bulk imports: When a large number of files need to be imported into Grooper all at once but only once.
  • Sporadic imports: When files need to be imported into Grooper from time to time, but not at any set schedule.

If these scenarios are right for you, you should follow the advice below to import email messages from the Imports Page.

If, on the other hand, you need to import emails regularly, either at a set schedule or immediately as they come in, you should instead follow the instructions in the #Automated (scheduled) email imports portion of this article.

For an interactive walkthrough of a user-directed email import, click the links below.

Automated (scheduled) email imports

Automated (a.k.a scheduled) imports are import jobs submitted by a Grooper Import Watcher service. An Import Watcher will automatically import file content according to a predefined schedule. This schedule will be executed in one of two ways:

  • Using a "Polling Loop" - Grooper will import files from a location on a continuous loop at a set interval (every 30 seconds, every 5 mins, every 24 hours, etc.)
  • Using "Specific Times" - Grooper will import files from a location at set days and times. For example, this can be used to run the import every Monday and Wednesday at 6:00 AM.

Automated imports are useful for any scenario where files hit an import source (like an email inbox) at regular intervals, continuously. The Import Watcher allows Grooper to watch that import source at regular intervals and process incoming content continuously.

For an interactive walkthrough of an automated Specific Times email import, click the links below.

Attachment considerations

After you import your emails into a Batch in Grooper, you have a couple of options. The first is to process any documents that are attached to the email and run those documents through the Batch Process.

Before you can begin working with email attachments, you must first expand the attachments in your Batch. To do this, you will need a Batch Process Step assigned the Execute Activity. The Execute Activity can be configured to perform a variety of Activities or Commands. The Command we're interested in for this tutorial is the Mail Message - Expand Attachments command.

Expanding the Attachment of the Email

  1. Add a Batch Process Step assigned the Execute Activity.
  2. Add the Mail-Message Expand Attachments Command.
  3. In the Command sub properties, set the Expand Attachments property to True.
  4. Assign the correct Scope to the Batch Process Step.
  5. If needed, move your Execute Step up in your Batch Process.
    • The Execute Step should come before any processing is performed on documents such as Split Pages or Recognize.

Body considerations

If instead of needing information off of an attachment to an email, you need the body of the email itself, you will need to start by expanding out the body rather than expanding out an attachment.

Expanding the Body of the Email

  1. Add a Batch Process Step assigned the Execute Activity.
  2. Add the Mail-Message Expand Attachments Command.
  3. In the Command sub properties, set the Body Expansion property to one of the following three options:
    • Prefer_Text - The body will be expanded as plain text if available. If not, it will expand as an HTML.
    • Prefer_HTML - The body will be expanded as HTML if available. If not, it will expand as plain text.
    • Text_HTML - The body will be expanded as plain text if available. If not, the HTML of the body will be written out as text.
  4. Assign the correct Scope to the Batch Process Step.
  5. If needed, move your Execute Step up in your Batch Process.
    • The Execute Step should come before any processing is performed on documents such as Split Pages or Recognize.

If you expand the body of the email as plain text, the structure of the document will not necessarily remain intact. You will only have access to the text and not the code that formats the email. There are many cases that the structure of a document is important, such as when the document contains table information.

If you want to make sure to keep the original structure of the email, you will need to select the Prefer_HTML options when expanding the body.

Rendering the Email Body

When expanding the body of an email, you may end up with a .htm or .html file. These are not files that Grooper can natively understand, so we will need to render these files to a PDF for Grooper to be able to work with the files.

If you haven't already, you will need to install and setup the Render Printer. Follow the instructions on the Render wiki page to do so. Then follow the instructions below:

  1. Create a new Processing Queue if you do not already have one for Rendering. Set the Concurrency Mode to PerMachine.
  2. Make sure you have an Activity Processing Service running that will just be dedicated to running the Render Activity.
    • The Queue Name property should be set to the dedicated Processing Queue.
    • The Number of Threads property should be set to 1.
    • For more information about creating a Processing Queue and installing an Activity Processing Service for the Render Activity, visit the Render wiki page.
  3. In your node tree, select the Batch that holds the files you want to render.
  4. In a Batch Viewer, right click on the Batch Folder with the .htm file attached.
  5. Hover over "Run Activity", hover over "Transform", and then select "Render..." from the menu.
  6. When the Render window appears, scroll through the properties and configure them to your preference.
    • The Render Attachments property should be set to False.
  7. When satisfied, click "Execute" on the Render window.
  8. Wait until the Render is complete.
  9. To see the rendered PDF, click on the "Renditions" icon in the top right corner of the Document Viewer and select "PDF".

Adding a Render Batch Process Step

Now that we've learned how to run the Render Activity manually, let's talk about how to add Render as a Batch Process Step to your Batch Process.

  1. Navigate to your Batch Process in your Node Tree.
  2. Right-click on the Batch Process.
  3. Hover over "Add Activity", hover over "Transform", and then select "Render..." from the menu.
  4. When the Add Activity window appears, change the Step Name if desired and then click "EXECUTE".
  5. With the Render Step selected in your node tree, click the hamburger icon "☰" to the right of the Queue Name property.
  6. Select the Processing Queue to assign to the Render Activity from the drop down.
    • This should be the Processing Queue that has been assigned to the Activity Processing Service dedicated to the Render Activity.
  7. Feel free to go through the Activity Properties and adjust per your specific use case. For rendering the email body, turn the Render Attachments property from True to False.
  8. Click the save icon to save your changes to the Render Batch Process Step.
  9. Finally, move your Render Batch Process Step to the top of your Batch Process right under the Execute Step configured to expand the body of the email.

Obtaining metadata from an email

Metadata is any information attached to a file or an email that does not necessarily include the text or image you see when you open the file/email. For a file on your computer, this might include the name of the file or the date the file was created. For an email it might include details such as the name and email address of the sender, the date received, or the subject line.

We can actually extract metadata from emails. There are two ways to do this. We can extract metadata upon import or we can configure the Data Field with a Read Metadata extractor that will populate the Field with metadata on extract.

Extracting metadata on import

To extract metadata on import, you first need to configure an Import Behavior on your Content Type and then configure the Read Mappings property on your Import Behavior. The Read Mappings property specifically allows you to read metadata from the import and assign it to populate a Data Field or other items such as the name of the document that's imported.

Configuring your Import Behavior

  1. On your Content Type, typically the Content Model, locate the Behaviors property and open the editor by clicking the "..." icon.
  2. In the editor, click the "+" icon and select "Import Behavior" from the drop down.
  3. Click the "..." to the right of the Import Definition property.
  4. Click the "+" icon to add a new Import Definition. Select "CMIS Import".
  5. Click the "☰" to the right of the CMIS Repository property. Navigate to and select the email repository you want to import from.
  6. The CMIS Content Type property should appear. Click the "☰" to the right of the property and select "Message" from the drop down since you are importing an email.
  7. The Read Mappings and Write Mappings properties should appear. Expand the Read Mappings sub properties.
  8. Set your Data Fields to the type of metadata you want to populate that Data Field. When finished, click "OK" on the window.
  9. Click "OK" on the Behaviors window.
  10. Save your changes.

Testing your Import Behavior

  1. Import an email to test your Import Behavior.
  2. Select the Data Model and click over to the "Tester" tab.
  3. Use the Batch Selector to select the imported Batch.
  4. Without any further processing, the Data Fields should be populated with the metadata as outlined by the Import Behavior.

Extracting metadata with the Read Metadata extractor

We have discussed how to extract metadata from emails on import. You can also extract metadata from emails during the Extract Step in your Batch Process. It does take a little more time to set up and requires some understanding of folder levels and Content Model hierarchy, but it can at times be more reliable than the Import Behavior when extracting information from documents.

When extracting information from an email, normally you will import the email and then have to either expand the body or the attachments of the email to get at the text information you want to extract. However, the metadata for the email will always be contained within that first file before the body or attachments have expanded. In order to get at the metadata, we will need to set up two different Data Models to run at different folder levels.

Let's take a look at an example.

Setting up the Model

  1. Add two Document Types to your Content Model, one for the email and one for the body or attachment.
    • The Document Type for the email will be configured at Level 1 and the body or attachment Document Type will be configured at Level 2.
  2. Manually assign each Document Type to the corresponding Folder Level in your test Batch.
  3. Add a Data Model to both Document Types.

Configuring Extraction at Level 1

  1. Add Data Fields to the Level 1 Document Type designed to extract the metadata, such as the Sender Name or Date of the email.
  2. Using the example of a Sender Data Field, set the Value Extractor property to "Read Metadata".
  3. Expand out the Read Metadata sub properties.
  4. Set the Source property to "Mail Message".
  5. Set the Property Name to which metadata you wish to be extracted (in our case, we would choose "From" since we want the Sender's name and email).
  6. Save your changes to the Data Field.
  7. Repeat the process for all Data Fields.

FYI

Read Metadata has fewer metadata source fields to select than extracting metadata with an Import Behavior. Keep this in mind when deciding which method to use.

Configuring Extraction at Level 2

  1. Navigate to the Data Model under the Level 2 Document Type.
  2. Add all Data Fields for data you want extracted from the body or attachment.
  3. Add additional fields for each of the Data Fields you configured to extract metadata in the Level 1 Document Type.
    • For now, leave these fields at defaults without extractors set on them.

The Export Behavior

  1. Configure an Export Behavior on the Level 2 Document Type node.
    • In the example below we use a File Export Definition, but configure any type you like.

Child Of and the metadata Data Fields on Level 2

  1. Select the Level 2 Document Type.
  2. Locate the Child Of property in the property grid.
  3. Set the Child Of property to the Level 1 Document Type.
    • This allows you to access the Data Fields that are under the Level 1 Document Type.
  4. Select the Data Field in the Level 2 Document Type that is named the same as the metadata Data Fields in the Level 1 Document.
  5. Scroll down to the Calculated Value property. Using an expression, we can tell Grooper to copy the extracted value from Folder Level 1 to Folder Level 2.
  6. Enter in the following Calculated Value Expression:

<Level 1 Document Type>.<Level 1 Data Field>

So in our example, the expression would be:

Email.Sender

  1. Set the Calculate Mode property to AlwaysSet.
  2. Test your extraction for both Data Models to make sure the correct information is extracted.
    • Note, that the Data Fields configured with the above mentioned Calculated Value will not appear during a test because nothing has actually been extracted from Level 1 yet for Grooper to copy to Level 2.


Once you have set up your model, you then need to set up your Batch Process to properly extract all the information. The two different folders need to be classified differently, so we will need two Classify Batch Process Steps. Also, since we are extracting information from two different levels, we need two different Extract Batch Process Steps.

A typical Batch Process for extracting emails might look something like:

  • Execute - Expand Attachments
  • Split Pages
  • Recognize
  • Classify
  • Extract
  • Export

A Batch Process for obtaining metadata from the Level 1 Document Type and text data from the Level 2 Document Type might look like this instead:

  • Execute - Expand Attachments
  • Split Pages
  • Recognize
  • Classify Level 1
  • Classify Level 2
  • Extract Level 1
  • Extract Level 2
  • Export

You might also have some Review steps in there to make sure your Batch Process is working as you expect it to.

You also want to make sure your Export Batch Process Step is set to run at Folder Level 2 where your Export Behavior has been set.


Changes to the Batch Process

Now that we have our attachments or body of our emails expanded, we can go ahead and process them through Grooper. However, the act of expanding out the attachments or email body has changed what folder level contains the document we want to process. We need to make sure all Batch Process Steps following the email expansion steps are configured at the correct Folder Level.

For example, the Split Pages Batch Process Step by default is set to a Folder Level of 1. However, after expanding out attachments, are documents will be located at Folder Level 2. We would need to adjust our Batch Process Step Scope to split the pages at the appropriate folder level.