2023:Export (Activity): Difference between revisions
| Line 585: | Line 585: | ||
|- | |- | ||
| | | | ||
[[File: | [[File:2023_Export_Activity_02_Export_Behaviors_02_2_Shared_Behavior_Modes_01.png]] | ||
| | | | ||
[[File: | [[File:2023_Export_Activity_02_Export_Behaviors_02_2_Shared_Behavior_Modes_02.png]] | ||
|} | |} | ||
Revision as of 12:07, 1 November 2023
The Export activity exports processed document content to an external storage platform.
Export is an Unattended Activity, typically added as of of the last steps (if not the last step) of a Batch Process. It allows Grooper users to deliver processed Batch content to an external system. Whether exporting Batch Folders as PDF files to a Windows folder, exporting extracted Data Model fields to a SQL database, exporting to a content management system, or some combination of multiple exports to multiple systems, the Export activity handles how document Batch Folders in a Batch ultimately leave Grooper after they have been classified and had their data extracted.
How documents are exported (what gets exported, where they go, and what format the exported content takes) is all controlled by Export Behaviors. This is a set of properties configured to control how Batch Folder content is exported based on its Document Type classification. Export Behaviors can be configured locally, configured as part of the Export activity's property configuration, or can be configured for a particular Content Type, by configuring the Behaviors property of a Content Model and/or its descendant Content Categories or Document Types.
About
So you've ingested some documents into a Batch. You've obtained their full text data with the Recognize activity, either through OCR or extracting their native embedded text. You've classified these documents, assigning the Batch Folders a Document Type from a Content Model during the Classify activity. You've collected the data you want from these documents during the Extract activity. Now what?
You need to get these documents and that data out of Grooper!
Enter the Export activity. Grooper is designed to be a document processing platform. It is a powerful tool to model document sets and their data (according to a Content Model) and put unprocessed pages or files through a step by step list of processing instructions (according to a Batch Process) to ultimately organize them and collect information from them. However, Grooper is not designed to be a content management system or a storage platform. Once your documents are organized and Grooper has extracted the data you want from them, you generally want to put those files and data in an external endpoint, such as a file system, a database, a content management system or some combination thereof.
The Export activity's job is to get document content out of Grooper, according to your specifications. Using one or more Export Behavior configurations, you can control how processed document content is exported, how its indexed in which storage location, what data goes where, what file format certain content should take, and more.
| FYI | How you export documents in Grooper underwent some serious changes in version 2021. In previous versions, there were two separate export activities: Document Export and Database Export.
To simplify things, we combined these two Activities into the singular Export activity. Whether you're exporting document files or data to a database, you use the Export activity and Export Behavior configurations in either case. |
Just What Is "Document Content"?
We're going to talk a lot about "document content" throughout this article. Ultimately, the Export activity controls what content is exported and how it is exported. So, what do we mean by "document content"?
In terms of its content, you can break up a document processed by Grooper into (at least) three meaningful components:
- The document's image
- The document's full text
- The document's extracted data
Each of these different kinds of content is another layer that comprises a whole document (represented as a Batch Folder in a Batch, its child Batch Pages and/or files attached to the Batch Folder). Grooper's job is to take source material (scanned pages or imported files), derive the content you desire (such as extracting Data Elements from a Data Model), and using the Export activity recombine this content into derivable files or data to one or more storage endpoints.
|
Image Content | |
|
The document's image is simply what the viewer physically sees when viewing the document. Whether scanned pages or a digital file, like a PDF, this content comprises the pixels on the screen you're looking at when reading a document. This content can be altered in a Batch Process by the Image Processing activity, which is a typical part of processing scanned documents to clean up the image before OCR. Upon Export, Grooper can build a new file from these images, or just export whatever image content was originally imported. | |
|
Full Text Content | |
|
A good deal of document processing automation requires machine readable text to parse words, phrases and other text data. Grooper obtains a document's full text data through the Recognize activity, OCRing images or extracting embedded digital text. These results can then be embedded into the exported file as another part of its content during Export. | |
|
Extracted Data Content | |
|
Last but not least, the Extract activity in a Batch Process will collect information from the document, according to its classified Document Type and Data Model. This may be simple indexing data, even just the Document Type assigned during the Classify activity. This may be every meaningful data point on the document, obtained from a Data Model with hundreds of extracted Data Elements. Regardless, this needs to be stored somewhere and somehow, such as in a SQL database, content management system, or as a separate data file, like an XML or CSV file. |
How you merge this content into new files, define what storage platform it goes to, and how extracted data can drive indexing considerations is all controlled by the Export activity's Export Behavior configuration.
Export Behaviors
The Export activity exports documents according to an Export Behavior. This is a set of export property configurations based on the Content Type (i.e. Document Type of a Content Model) assigned to a document Batch Folder during document classification. Once a Batch Folder is assigned a Document Type, you have something you can point to that controls the flow of traffic out of Grooper.
For documents "A", build a PDF file and put them in folder "A" in a file system, for example. For documents "B", put them in folder "B" and export their data to a database while you're at it. For document "C", you might do something entirely different. Or, you might perform essentially the same export for all Document Types in a Content Model. Export Behavior configurations are how you tell Grooper what to do for one Document Type or another upon export.
Export Behaviors can be configured for any Content Type object. This includes a parent Content Model or any of its descendant Document Types or Content Categories.
This allows you to use the Content Model's hierarchy to determine how you want to export documents of a certain Document Type.
- If you want to perform the same, generic export for all Document Types in a Content Model, you can configure a single Export Behavior solely for the Content Model applying to all its child Document Types.
- If a group of Document Types under a single Content Category all should be exported in the same manner, you can configure an Export Behavior for the Content Category. Those settings will apply to any of its child Document Types.
- If every Document Type or certain Document Types have their own specific export configuration, you can configure individual Export Behaviors for one or more Document Types (or all of them!).
Export Behaviors can be configured in one of two ways:
- Using the Behaviors property of a Content Type object
- A Content Model
- A Content Category
- Or, a Document Type
- As part of the Export activity's property configuration
In either case, export settings are added as one or more Export Definitions of the Export Behavior. Once a document is classified and it is assigned a Document Type its Export Behavior's configured Export Definition(s) will define how the document content is exported. The main difference is how you get to the Export Behavior property.
Content Type Export Behaviors
|
An Export Behavior configuration can be added to any Content Type object (i.e. Content Models, Content Categories, and Document Types) using its Behaviors property. Doing so will control how a Document Type "behaves" upon export.
|
|||
|
|||
|
Export Activity Export Behaviors
|
Export Behaviors can also be configured as part of the Export activity's configuration. These are called "local" Export Behaviors. They are local to the Export activity in the Batch Process.
|
|
|
|
|
Export Definitions
Regardless of whether the Export Behavior is set up directly on the Content Type object or with the Export activity's local property grid, how document content is exported is defined using one or more Export Definitions.
Export Definitions functionally determine three things:
- Location - Where the document content ends up upon export. In other words, the storage platform you're exporting to.
- Content - What document content is exported: image content, full text content, and/or extracted data content.
- Format - What format the exported content takes, such as a PDF file or XML data file.
|
Your primary consideration is Location. Where do you want these files and/or data to end up? Are you exporting files to a Windows file system? Are you exporting data to a database? Are you exporting content to a content management system, like Box.com? When configuring an Export Definition the first thing you will add is an Export Type. This determines what export endpoint you're using to export document content. The Export activity will deliver document content to the storage platform determined by the Export Type.
|
Export Types
Each Export Type defines connection to the endpoint storage location slightly differently.
CMIS Export
|
For CMIS Export, document content is exported over a CMIS Connection.
For more information, please visit the CMIS Repository and CMIS Export articles. |
Data Export
|
For Data Export, extracted data content is exported to a SQL database or ODBC compliant database, using a Data Connection object.
|
File Export
|
For File Export, document content is exported to a Windows file system folder.
|
IMAP Export
|
For IMAP Export document content is exported to email servers using the IMAP protocol.
|
FTP Export
|
For FTP Export, document content is exported to an FTP site using the FTP protocol.
|
SFTP Export
|
For SFTP Export, document content is exported to an SFTP site using the SFTP protocol.
|
When choosing an Export Type, you should be asking yourself "Do I want to export files, data, or both?". What content you want to export will inform which location (or locations) you export to. Your answer to this question will impact which Export Type you choose and how you configure it to export document Batch Folder content.
|
Data Only |
If you're purely exporting document data content (values collected from the Extract activity) and nothing else, you're likely looking to export data to a database.
|
|
Files Only |
If you're looking to export files, such as PDFs, TIFs, and text files, you have more options depending on the storage location you want your document files to wind up in. Use any of the following Export Formats, depending on where you want to export.
All of these Export Types have a configurable Export Format property, which will allow you to build an export file of a given format out of Batch Folder content.
|
|
Both Data and Files |
When exporting document content, there are a variety of ways to export both data and files.
|
Export Formats
When exporting content to an export location you must determine what format that content takes. There's all different types of files out there. Some of them are better suited to house different types of content than others. XML files are great for storing data, but not so much for image content. TIF files are great for image content, but not so much for full text data.
Export Formats can be configured for any of the Export Types that export files:
- CMIS Export
- File Export
- FTP Export
- SFTP Export
- IMAP Export
|
|
|
Below we will briefly describe each Export Format to give you a better idea of what content you can export with each format.
PDF Format
|
|
|
This will export a PDF file, according to the PDF Format's property grid settings.
|
XML Format
|
|||
|
JSON Metadata
|
|
|
Simple Metadata
|
|
|
As you can see here, Data Fields are exported to a text file as a simple list of key-value pairs.
|
Delimited Metadata
|
|
|
TIF Format
|
|
|
TIF (or TIFF) is a format used to store high quality raster graphics for graphic design or publishing. However, keep in mind this is an image only format. If you want text-behind embedded in your files, you must use the PDF Format. |
Text Format
|
|
|
Upon export, this will generate a text file from the Batch Folder's raw OCR text data, generated from the Recognize activity. |
Attachment
|
For document files imported from a digital source, the Attachment format will output the Batch Folder's attachment file. |
|||
|
|||
|
Remember, Export Behaviors can be configured in one of two ways:
- Using the Behaviors property of a Content Type object
- As part of the Export activity's property configuration
In general, most users will choose to do one or the other. This may just be as simple as what your preference is. Personally, I prefer to set up Export Behaviors on the Content Model and/or its child Document Types and Content Categories. You may prefer to configure one or more Export Behaviors local to the Export activity's property panel. Either way, the Export Behavior (or set of Export Behaviors) will export documents as you configure it.
However, what happens if you have both? For example, Export Behaviors configured on one or more Content Type objects as well as local to the Export activity.
Grooper needs to understand which one should take priority preference, or if both should execute in one way or another. This can accommodate more complex exports, but there are different ways you can define how 'Export Behaviors are shared between Content Types and local Export activity configurations.
This is what the Shared Behavior Mode property of the Export activity is for. It defines how "local" and "shared" Export Behaviors are executed when the Export activity exports Batch Folders in a Batch.
|
Local Behaviors |
Shared Behaviors |
|
"Local behaviors" are Export Behaviors configured in an Export activity's local property grid. |
"Shared behaviors" are Export Behaviors configured for a Content Type object (Content Models, Content Categories, and Document Types), using its Behaviors property. |
|
The Shared Behavior Mode property is configured in the Export activity's property grid. It can be set to one of the following values:
|
Imagine you have both "shared" and "local" Export Behaviors for two Document Types: "Orange" and "Red"
|
Shared Behaviors (Configured with the "Orange" and "Red" Document Type's set of Behaviors properties)
|
|
|
Local Behaviors (Configured with the Export activity's local property grid)
|
Depending on how you configure the Shared Behavior Mode property, you're going to end up with different results for your export.
|
With the Export activity left unconfigured, no local Export Behaviors are applied. Only the Document Types' behaviors will execute.
|
|||
|
With local Export Behaviors and the Shared Behavior Mode set to None, only the local Export Behaviors will execute.
|
|||
|
With the Shared Behavior Mode set to SharedOrLocal, shared behaviors will execute first. Local behaviors will only execute if no shared behavior is present. Only our shared Export Behaviors execute in this instance. Grooper doesn't even bother to look at the local Export Behavior for the "Orange" and "Red" Document Types because shared Export Behaviors are present for these Document Types.
|
|||
|
With the Shared Behavior Mode set to LocalOrShared, local behaviors will execute first. Shared behaviors will only execute if no local behavior is present. In our case, no local Export Behavior is present for "Red" documents, but there is a shared Export Behavior configured on the "Red" Document Type.
|
|||
|
For the And modes (LocalAndShared and SharedAndLocal) both local and shared Export Behaviors execute. The only difference is which one executes first.
|
Now, you should be asking yourself "If the result was the same for both LocalAndShared and SharedAndLocal, why even have two different options?"
That's not always going to be the case. And this is where things can get tricky with Shared Behavior Modes.
Issues can occur if you are exporting the same file type, with the same name, to the same folder with both shared and local Export Behaviors. If both Export Behaviors are configured to export a PDF Format file, for example, but with different File Format configurations, you could end up with a situation where one behavior overwrites the other's export. This may be what you want to do. It may not. Just be aware since both Export Behaviors execute, either the local or shared behavior can potentially overwrite whichever one exported a file first.
Imagine our shared and local behavior configurations were a little bit different.
|
Shared Behaviors
|
|
|
Local Behaviors
|
With these configurations and Export's Shared Behavior Mode set to SharedAndLocal, we would end up overwriting a file. First, the shared behaviors would execute, then the local behaviors would execute. For the "Red" documents, the shared behavior would export its version of a PDF, then the local behavior would export its version of a PDF. If the two files, both the same PDF file type, share the same name, the default configurations will overwrite existing files in a folder location.
|
Be aware of this possibility when configuring SharedAndLocal or LocalAndShared exports. If the file names, types and export folder locations are the same, you may end up overwriting a file. If this is your intention, great! If not, you will need to ensure the file names for the files generated by shared and local behaviors are unique to avoid one file being overwritten.
Thread Pool Guidance
When automating Export steps in a Batch Process, you may need to execute the activity single threaded.
Unattended Activities in a Batch Process can be automated using an Activity Processing Grooper service. The Activity Processing service will act like a Windows service and automatically start tasks in a Batch, as processing threads in your system's resources become available. This is one of the ways Grooper leverages your system resources for parallel processing.
Imagine you're running Grooper on a machine with eight (8) processing threads. If you have a Batch with five (5) Batch Folders, and each one is on the Recognize step of the Batch Process, there's no need for your system to process each Batch Folder sequentially (with each Batch Folder waiting to be processed until the one before it is finished).
- You have 8 threads and 5 Batch Folders in this scenario.
- Each one of those threads can process one Batch Folder as a single task.
- With 8 available threads, all 5 Batch Folders could be processed concurrently by 5 individual threads.
- This is multi-threaded Activity processing.
However, depending on which external storage system you're exporting to, you may run into errors if you attempt to run the Export activity multi-threaded. Particularly when it comes to cloud-based systems, like SharePoint online or Box.com, their file transfer protocol expects users to upload files one at a time. If you have 5 threads all attempting to upload 5 different Batch Folders from the same machine, 4 of those Batch Folders are going to kick back to Grooper in an error state.
Instead, you must run the activity single-threaded, ensuring only one Batch Folder is processed at a time. As well as automating Batch Processing activities, Activity Processing services allow you to control thread resources by assigning activities a Thread Pool and limiting the number of maximum threads available for that Thread Pool.
Next, we will show you how to create a single threaded Thread Pool for an Export activity, and set up an Activity Processing service that utilizes it. This will effectively throttle your export, so Batch Folders are indeed only exported one at a time, avoiding any issues with external platforms that cannot handle multi-threaded exports.
How To: Assign a Thread Pool to an Activity Processing Service to Run Export Single-Threaded
Add a Thread Pool
The first thing you'll need to do is add a Thread Pool object. A Thread Pool defines the "bucket" of threads available to one step or another in a Batch Process. In our case, this will allow us to limit the number of threads the Export step uses to a single thread.
|
To add a Thread Pool:
|
|
|
Assign the Thread Pool
|
Next, we need to tell our Batch Process which step should use our new Thread Pool.
|
|
|
We want to tell the Export step of this Batch Process to use a different Thread Pool, the new one we just created.
|
Configure an Activity Processing Service
On to Grooper Config! Grooper services are installed and edited using Grooper Config. Open Grooper Config to install a new Activity Processing service.
| ⚠ | Grooper Config must be run as an administrator to install and edit services. |
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|
|||
|



































































