CMIS Import

From Grooper Wiki
Jump to navigation Jump to search

CMIS Import is an Import Provider used to import content over a CMIS Connection, allowing users to import from various on-premise and cloud based storage platforms.

Documents are imported from CMIS Connections using either the Import Descendants or Import Query Results providers. These can be used in two ways:

  • To perform manual "ad-hoc" imports when creating a new Batch in Grooper Dashboard or Grooper Design Studio.
  • To perform automated, scheduled imports using one or more Import Watcher Grooper services.

Import Descendants will import all documents within a designated folder location of a CMIS Repository. Import Query Results allows you to use a query syntax similar to a SQL query (called a CMISQL query) to set conditions for import based on the item's available metadata, such as a documents name, file type, creation date, archive status, or other variables.


About CMIS

CMIS stands for "Content Management Interoperability Services".  It is an open standard that allows different content management systems to inter-operate over the Internet.  This standard protocol allows Grooper to use many different platforms for importing and exporting documents and their contents.  Once a CMIS Connection object is created, Grooper can exchange documents with these platforms. "Interoperability " means Grooper has the same access to control the system as a human being does. It is a "one-to-one" connection to the platform, allowing full and total control.

Upon connecting to an external content management system, Grooper will be able to see the "repositories" associated with it.  A repository, in computer science, is a general term for a location where data lives. Different systems refer to "repositories" in different ways.  An email inbox could be a repository. A folder in Windows could be a repository. A cabinet in ApplicationXtender could be a repository. It's a place to put things. We standardize the various terms used by various storage platforms to simply "repository".

These repositories are "imported" into Grooper as a CMIS Repository object, as a child of the CMIS Connection object. This doesn't import data into Grooper in the traditional sense of importing documents into a batch. "Importing" here is more like bringing the repository into a framework Grooper can use (creating the CMIS Connection object). Upon importing the repository Grooper has full file access to that location in the storage platform.

For our purposes, repositories are like filing cabinets full of documents.  Once a connection is established, it's like giving Grooper a key to that cabinet.  You can open the various drawers of that cabinet. You can pull out files and put files into. The storage platform or content management system is like the cabinet. The CMIS Connection object is like the key. The CMIS Repository object is like a drawer in the cabinet. You "connect" to the cabinet by turning the key. You "import" the repository by opening the drawer. Now you can see there are documents in there! You can take them out. You can read them and put them back in. You can put new ones in. You can use this "open" connection to the "drawer" however you need.

CMIS+ Architecture

Grooper expanded on this idea in version 2.72 to create our CMIS+ architecture. CMIS+ unifies all content platforms under a single framework as if they were traditional CMIS endpoints. Prior to version 2.72, there was only one type of CMIS Connection, a true CMIS connection using CMIS 1.0 or CMIS 1.1 servers. Now, connections to additional non-CMIS document storage platforms can be made via "CMIS Bindings". This provides standardized access to document content and metadata across a variety of external storage platforms.

Using this architecture, Grooper is able to create a simpler and more efficient import and export workflow, using a variety of storage platforms. You now use the CMIS Import and CMIS Export providers, regardless of the storage platform. They connect to a CMIS Repository imported from a CMIS Connection and use that as Grooper's import or export path.

How you create a CMIS Connection only differs from CMIS Binding to CMIS Binding, as each binding has a different way of connecting to it. You don't connect to an Outlook inbox the same way you connect to a Windows file folder, for example.

CMIS Bindings

A CMIS Binding provides connectivity logic for external storage platforms, allowing CMIS Connection objects to import and export content. Grooper's CMIS+ architecture expands connectivity from traditional CMIS servers to a variety of on-premise and cloud-based storage platforms by exposing connections to these platforms as CMIS Bindings. Each individual CMIS Binding contains the settings and logic required to exchange documents between Grooper and each distinct platform. For example, the AppXtender Binding contains all the information Grooper uses to connect to the ApplicationXtender content management system.

CMIS Bindings are used when creating a CMIS Connection object. The first step to creating a CMIS Connection is to configure the Connection Type property. Which binding you use (and therefore which platform you connect to) is set here. First, the user selects which CMIS Binding they want to use, selecting which storage platform they want to connect to. The second step is to enter the connection settings for that binding, such as login information for many bindings.

Current CMIS Bindings

Grooper can connect to the following storage platforms using below using CMIS Bindings:

About CMIS Import

The CMIS Import provider is split into two different Import Providers

  • Import Descendants
  • Import Query Results

These providers are designed to import files from a folder structure of an on-premise or cloud-based document storage platform. This is the primary method of Batch creation when importing digital documents into Grooper to process them with a Batch Process.

In order to do this, a few requirements must be met first.

  1. A CMIS Connection object must made and configured. This will connect Grooper to the document storage platform.
    • This may be a connection to a Windows folder, an email inbox, a true CMIS content management system, or other document storage platforms. What the CMIS Connection connects to is determined by the CMIS Binding selected when configuring the Connection Type property of the CMIS Connection object.
  2. A CMIS Repository must be imported. This will create an object Grooper can use to import documents from the folders in the document storage platform.
    • This acts as a "go-between" or a "hub" for Grooper to pull in documents from the content's source. Or, you may think of this as Grooper's representation of a folder location in the document storage platform.

For more information on adding a CMIS Connection and importing a CMIS Repository, visit the CMIS Connection article.

As for the difference between the Import Descendants and Import Query Results providers, you can think of Import Query Results as a more specialized version of Import Descendants.

  • Import Descendants is intended to import the full contents of a folder location. It imports the "descendant" files of a parent folder.
  • Import Query Results allows you to selectively import files using a SQL-like query (called a CMISQL query). Only files returned by the query will be imported. For example, using an Exchange or IMAP CMIS Connection, you could query an inbox for emails from a specific sender and only import those emails.
    • Note: There are some import filtering capabilities available to Import Descendants as well using a SQL-like query. However, the CMISQL query Import Query Results uses is much more robust. That said, only certain CMIS Bindings can take advantage of this increased CMISQL query functionality.
    • The following CMIS Bindings are not currently suitable for the Import Query Results provider.
      • NTFS
      • FTP
      • SFTP
      • OneDrive

Import Descendants

Configuration Panel

This is the configuration screen for the Import Descendants provider. This example uses a simple configuration to import a few PDFs from a local Windows folder. Configuration is divided into four sections:

  • General
  • Processing Options
  • Disposition
  • Batch Creation

Cmis-import-import-decendants-1.png

General Settings

At bare minimum, you will need to tell Grooper where to look for the imported files. That is, mostly, what the General property settings are for.

In this case, we want to import the PDF files in this folder on the local hard drive.

Cmis-import-import-decendants-2.png

As discused earlier, there are some minimum requirements before configuring Import Desendants.

  1. Here, a CMIS Connection has been made and a child CMIS Repository has been imported.
    • In this case, the NTFS binding was used for the CMIS Connection's Connection Type. The folder named "Import and Export" was imported as the CMIS Repository.
  2. The folder named "Grooper Import Folder" is where we want to import from.

Cmis-import-import-decendants-3.png

Back top the Import Descendants configuration screen, the CMIS Repository object is used to point Grooper to this folder location for import.

  • The Repository property is configured to assign the CMIS Repository where the documents are located.
    • Here the CMIS Repository named "Import and Export" connecting to the "Import and Export" folder of the local drive.
  • The Base Folder property is configured to traverse the folder structure of the CMIS Repository.
    • Here, we don't want to import all documents from every folder in the "Import and Export" folder. We just want to import from the "Grooper Import Folder".
  • The Import Filter property allows you to perform some basic import filtering to selectively choose which documents you want to import.
    • SELECT * FROM File is the default filter. It will import all files from the selected folder location.
    • This is a SQL-like query to specify conditions for document import. However, the Import Query Results provider was created to expand on this functionality and provides more filtering options as well as a simpler interface to perform the query (for the CMIS Bindings capable of utilizing this functionality).
  • The Content Type property allows you to optionally assign the incoming documents with a Document Type.
    • You can use this property to assign a default classification for all incoming documents.

Cmis-import-import-decendants-4.png

Processing Options Settings

The most important part of the Processing Options property section is the Import Mode property.

The Import Mode property allows control over the connections Grooper makes and/or retains to the imported documents.

For importing, documents contain two important sets of information:

  • Content - Images and native text data
  • Properties - Metadata associated with the file. Digital information, such as the document's filename, file type, creation date, and more.

Depending on the Import Mode selected all, some, or none of this information will be copied to your Grooper Repository's file store (in the case of the document's content) and database (in the case of the document's properties). See below for more in depth explanation of each of the Import Mode options.


Cmis-import-import-decendants-5.png

Full

  • Both properties and content will be loaded. This is a total duplication of the document from its source to your Grooper Repository's local file store. This is the slowest import mode, because the full content of each document is copied during a single-threaded import process. As such, this mode is not well-suited for high-volume imports, but provides some useful advantages in low-volume import scenarios.
  • For example, Full mode allows items to be deleted immediately on import. Also, Full mode avoids the need for any follow-up content loading operations in the Batch Process.

Sparse

  • Properties will be loaded, but content will not. This mode is much faster than a Full import, because no content files are copied into your local Grooper file store. Instead, a link is saved on each Grooper document, and content is retrieved on demand directly from the CMIS Repository. This type of document is often referred to as a "sparse" document. Sparse documents can be used just like any other document, with the caveat that display and processing speeds may be reduced. Grooper has to traverse the document link in order to display or process the document's image.
  • However, after a Sparse import, document content can be loaded multi-threaded using the Execute activity in a Batch Process. This can overall lead to importing a document's content faster than a Full import. While the
    • Choose CMIS Document Link as the Object Type and Load Content as the Command

Link Only

  • No content or properties will be loaded, making this the fastest import mode. It imports nothing more than a link to each document, and offloads all property and content loading to parallel operations in the Batch Process.
  • However, this does not produce a usable document in Grooper. After a LinkOnly import, document content must be loaded using the Execute activity in a Batch Process.
    • Choose CMIS Document Link as the Object Type and Load Content as the Command
  • You can think of the Link Only option as an even sparser sparse import.


See the table bellow for a summary of the Import Mode options.

Import Mode Speed Comments
Full Slow Full import of content and their properties.
  • Required if deleting content from the source on import.
Sparse Fast Imports a link to the document's source and its properties but not their content.
  • This produces a usable document in Grooper without copying the full content into Grooper, saving time upon import.
  • This mode is the same as enabling the old Sparse Import property in previous versions.
Link Only Fastest Only imports a link to the document's source.
  • Does not produce a usable document. The document's properties must be loaded in a step in a Batch Process.

Disposition Settings

The Disposition property settings allow you to do something with the source documents after importing them into Grooper, namely delete them, move them, or do nothing and just leave them alone where they came from. This is often leveraged with the Import Watcher Grooper service to prevent repeatedly importing the same document.

In our example here, the Move to Folder property is configured to move the PDF documents to a folder named "Imported Documents".

  • The folder location you're moving documents to must be accessible via the connected CMIS Repository.

If using the Full Import Mode, you can enable the Delete Item property to delete each document after it is imported into the Grooper Batch.

  • This property is ONLY available when choosing the Full Import Mode. A sparsely imported document needs to call to the import storage location in order to load the document's image for display or processing. If you deleted the document upon import, you wouldn't be able to view it or do anything with it.

The Update Properties property allows you to alter the document's property values upon import. Property values are updated using a list of "key-value pairs" where the "key" is the name of the property and the "value" is what change you want to make to that property. You can type one entry per line in the format key=value.

  • Examples:
  • Archive=true Sets the archive attribute on a file
  • Status=PENDING Sets the "Status" field on ApplicationXtender documents.
  • Imported=true Sets the "Imported" field on SharePoint documents.
  • IsRead=true Sets the "IsRead" flag on an Exchange message.

Cmis-import-import-decendants-6.png

Batch Creation Settings

It's likely you're importing documents because you want to run them through a Batch Process. The Batch Creation property settings allow you to define which Batch Process you wish to use to process the imported documents.

This is done using the Starting Step property, selecting a Batch Process Step in a Batch Process from the published Batch Processes in the Grooper Repository. Upon import, a new Batch is created with each document as a Batch Folder, and the selected Batch Process assigned to the Batch.

There are also further properties to control Batch creation. You can limit the number of documents imported per Batch using the Maximum Items per Batch property. By default, new Batches are named with a date/time stamp. However, the Batch Name Prefix allows you to tack on a prefix to the Batch's name for easier identification. The Start Paused property will automatically trigger the Batch Process if set to False.

Cmis-import-import-decendants-7.png

Import Query Results

Configuration Panel

The Import Query Results provider's configuration panel is almost identical to the Import Descendants provider's configuration panel. Both providers share the same Processing Options, Disposition, and Batch Creation property settings. See the Import Descendants section for brief descriptions of these property sections.

The big difference between the two providers is the highlighted CMIS Query property. This allows users to enter a SQL-like query (called a CMISQL query) to selectively import documents from their source, based on certain metadata properties. Only files returned by the query will be imported. For example, you may want to only import documents of a certain file type. You could include the file extension as the query condition. You can use CMISQL queries to easily filter email messages when importing from an inbox. If you only wanted to import messages from a certain sender, you could include the sender's email address as the query condition.

Only certain external storage platforms are currently queryable with the CMIS Query property. The following CMIS Binding sources cannot be queried currently. As such, they are not suitable for Import Query Results. You should instead use Import Descendants for the following CMIS Bindings.
  • NTFS
  • FTP
  • SFTP
  • OneDrive

Cmis-import-import-query-results-8.png

Just like with Import Descendants, there are some minimum requirements before configuring Import Query Results. A CMIS Connection object must be created and a CMIS Repository must be imported.

  1. Here, a CMIS Connection has been made and a child CMIS Repository has been imported.
    • In this case, the Exchange binding was used for the CMIS Connection's Connection Type. This binding is used to connect Grooper to Microsoft Exchange email servers.
  2. As you can see, all the folders in this email inbox are accessible to Grooper.

Cmis-import-import-query-results-9.png

  1. To enter the CMISQL query, first use the Repository property to select the CMIS Repository you are importing from.
  2. Select the CMIS Query property.
    • Note: If you select a CMIS Repository that is not queryable (such as NTFS repositories), this property will not be displayed.
  3. Press the ellipsis button at the end to bring up the CMIS Query editor window.

Cmis-import-import-query-results-10.png

CMIS Query Configuration

Upon pressing the ellipsis button at the end of the CMIS Query property, the following window will appear.

This interface allows you to configure the CMISQL query based on available metadata from the CMIS Binding. For example, the Exchange binding has a selection of queryable metadata for email messages, such as the email's subject, sender and date the message was received.

The query example here selectively filters an email inbox based on the following conditions:

  1. Only email messages are to be imported.
    • This is controlled by the Content Type property, here set to Message. Different CMIS Bindings have different Content Types, depending on the storage platform. Some platforms are simpler and only have File and Folder, corresponding to files and folders in the storage platform. Some, such as Exchange have additional Content Types for different types of content. The Message Content Type corresponds to email messages. By limiting the Content Type to Message we aren't going to import other content available to the Exchange binding, such as appointments or contacts.
  2. All properties are going to be searched.
    • This is filtered by the Select Elements property. You can choose to limit which metadata properties are queried using this property. The * character indicates all properties are queried.
  3. Only messages in the "Wiki" folder in the inbox will be imported.
    • The Search Scope property allows you to control where in the storage platform's folder hierarchy you wish to search. If you leave this property blank, the entire repository's folder structure will be queried.
  4. Only messages with certain properties are to be imported. Only emails sent by "[email protected]", with "Wiki Vitals" in the title, sent after 12/01/2020 that have not been read should be imported.
    • These properties are filtered by the property search grid. Here, all queryable metadata for the CMIS Binding and selected Content Type are displayed. For each property, you can use the operator column and search value column to indicate what conditions must be met for import.
    • Note: For text searching (as we did to query the "Subject" and "Sender" properties), use the "LIKE" operator and place percentage symbols (%) before and after your search string.

Cmis-import-import-query-results-11.png

This configuration editor writes the CMISQL query for you. You can verify the query using the "CMISQL" tab.

  1. Switch to the "CMISQL" tab.
  2. In the text editor here, you can see the full CMISQL query. You can also us this editor to manually type out full queries yourself.

Cmis-import-import-query-results-12.png

Whether in the "Basic Search" tab or the "CMISQL" tab you can verify the results of the query, using the "Execute Query" button. This will display a list of items that will be imported by the Import Query Results provider.

  1. Press the "Execute Query" button.
  2. A list of items satisfying the query conditions will populate in the list below.
  3. Press the "OK" button when finished configuring your query.

Cmis-import-import-query-results-13.png

Example Queries

These are samples of a query string. They take the following general form.

SELECT * FROM <ContentType> WHERE <Criteria> ORDER BY <Sort>

Let's break each component, or "clause", to get a better idea of how this works

SELECT

This specifies which properties are to be returned with query results.

If you are querying all properties the asterisk or * will indicate all properties should be returned.

For example: SELECT *

Otherwise, you will list them out separated by commas.

For example: Let's say you are querying an Exchange repository and you only want to search the sender and recipients properties. You'd type SELECT Sender, ToRecipients, CcRecipients, BccRecipients to limit your query to only those four properties.

FROM

This clause indicates the type of object to search for. This will be a content type defined in the CMIS Repository. If the content type is document based, the query result will be a CMIS Document. If it is folder based, it will be a CMIS Folder.

The content type specified in the FROM clause has two jobs. One, it defines what properties are available to the other clauses. Two, it limits the scope of the search to only objects of the type specified in the clause.

For example: Let's say you are querying an Exchange repository and want to search email messages and not contacts or tasks or appointments. You'd type FROM Message to limit your query to just the Message content type.

WHERE

This is how you define what search conditions must be met to be included in your set of returns. Multiple conditions can be joined with the AND or OR or NOT operators. You can change the order of operations by using nested parenthesis. Each condition is followed by a predicate. The following is a list of predicates. Note not every property type may be able to utilize every predicate.  For example, the Subject property on the Exchange binding cannot use the "=" operator.

Predicate Description Example
Comparison Predicate Specifies a condition for an individual property using comparisons, such as "equals to" or "less than".  The LIKE and IS operators are also a comparison predicates. invoice_date<'12/31/2007'
IN Predicate Specifies a list of allowed values for a property. This list is separated by commas. FileExtension IN ('.pdf', '.docx', '.xlsx')
CONTAINS Predicate Specifies a full-text query. You can use AND, OR and NOT operators. CONTAINS('mortgage AND payment AND NOT vehicle')
Scope Predicate Restricts the search scope to children or descendants of a folder IN_FOLDER(/Inbox)

Note: The NOT operator cannot be used with the IN_FOLDER or IN_TREE predicates.

For example:  Let's say you are querying an Exchange repository and want to find an email which contains the words "cake" and "free" but not "birthday" in it, that was not received last Christmas Day found in the inbox folder and has attachments.  That would look something like WHERE IN_FOLDER(/Inbox) AND CONTAINS('cake AND free AND NOT birthday') AND (DateTimeReceived<>'12/25/2018') AND (HasAttachments=False)

ORDER BY

This is an optional clause which allows you to specify the order in which results are returned.  You can sort by multiple properties using a comma separated list.  Optionally, each property name may be followed by ASC or DESC to indicate ascending or descending sort direction.  The default sort direction is ascending.

For example:  If you wanted to sort a query of an Exchange repository by both whether they have attachments and by size in descending order you would type ORDER BY HasAttachments, Size DESC

Putting It All Together

Let's mash all our examples together and search for email messages in the Inbox that have the words "cake" and "free"  but not "birthday" in the body, received any day besides Christmas Day.  I'm going to go ahead and search all the properties available to me, and I want to sort the results by whether the message has attachments and by size in descending order.  This would be the resulting query

SELECT * FROM Message WHERE IN_FOLDER(/Inbox) AND CONTAINS('cake AND free AND NOT birthday') AND (DateTimeReceived<>'12/25/2018') ORDER BY HasAttachments, Size DESC

More Examples

Filter Description
SELECT * FROM File Import all descendant files.  This will import all files in the repository without any foldering.
SELECT * FROM File WHERE AT_LEVEL(1) Import files which are immediate children.  This will only import files at that level, not from subsequent levels.
SELECT * FROM Folder Import folders which are immediate children.  This will import both files and their foldering.
SELECT * FROM File WHERE cmis:name MATCHES '^\d{4}-\d{2}-\d{2}' Import files with a specific naming pattern, using regular expression.
SELECT * FROM File WHERE cmis:name LIKE 'ca%' Import files with a name starting with ca.
SELECT * FROM File WHERE cmis:contentStreamLength > 10000 Import files larger than 10,000 bytes.

Version Differences

Box Integration (2.90)

Grooper 2.9 sees the addition of the Box.com document storage platform into the CMIS fold via the Box (CMIS Binding).

Legacy Providers (2.72)

Old import and export providers should be replaced with this new functionality. While Grooper's older import and export providers are available as "Legacy Import" and "Legacy Export" providers, these components are depreciated. They will still function but will no longer be upgraded in future versions of Grooper.

Grooper can import documents using CMIS Connections via Import Descendents and Import Query Results. Grooper can export via the CMIS Export providers, Mapped Export and Unmapped Export.

New Connection Types (2.72)

By creating the CMIS+ architecture, we have been able to create new connections between Grooper and content management systems. Grooper can now connect to Microsoft OneDrive, SharePoint, and Exchange via new CMIS Bindings. Since these were created as CMIS Bindings, they can be used by the CMIS Import and CMIS Export providers. Instead of having to create three new import providers and three new export providers for a total of six brand new components, we can use the already established CMIS import and export providers in the CMIS+ framework. A user can create a CMIS Connection using the OneDrive, SharePoint or Exchange bindings, and use the same import and export providers for them as any of the other CMIS Bindings.

This will also allow Grooper to create CMIS Bindings to connect to currently unavailable content management systems in the future much quicker and easier.

Import Mode (2.72)

In version 2.72 the Import Mode property replaces previous versions' Sparse Import property.

Import Disposition (2.72)

2.72 adds the Import Disposition property to CMIS Import. This allows you to change your documents disposition upon importing them into Grooper. You can delete them, move them to a folder, or update one or more properties on the document itself. This can be leveraged with Import Watcher to prevent repeatedly importing the same document.