2023:Microsoft Office Integration (Concept)

From Grooper Wiki

This article is about an older version of Grooper.

Information may be out of date and UI elements may have changed.

202520232.90
Native text for Microsoft Office applications is a powerful data integration tool in Grooper.

Easier access to the contents of files from the worlds most used business application suite. Convert Word and Excel files to formats Grooper can read with ease with our Microsoft Office integration.

You may download the ZIP(s) below and upload it into your own Grooper environment (version 2023). The first contains a Project with resources used in examples throughout this article. The second contains one or more Batches of sample documents.

Glossary

Activity: Grooper Activities define specific document processing operations done to a inventory_2 Batch, folder Batch Folder, or contract Batch Page. In a settings Batch Process, each edit_document Batch Process Step executes a single Activity (determined by the step's "Activity" property).

  • Batch Process Steps are frequently referred by the name of their configured Activity followed by the word "step". For example: "Classify step".

Batch Folder: The folder Batch Folder is an organizational unit within a inventory_2 Batch, allowing for a structured approach to managing and processing a collection of documents. Batch Folder nodes serve two purposes in a Batch. (1) Primarily, they represent "documents" in Grooper. (2) They can also serve more generally as folders, holding other Batch Folders and/or contract Batch Page nodes as children.

  • Batch Folders are frequently referred to simply as "documents" or "folders" depending on how they are used in the Batch.

Batch Process Step: edit_document Batch Process Steps are specific actions within a settings Batch Process sequence. Each Batch Process Step performs an "Activity" specific to some document processing task. These Activities will either be a "Code Activity" or "Review" activities. Code Activities are automated by Activity Processing services. Review activities are executed by human operators in the Grooper user interface.

  • Batch Process Steps are frequently referred to as simply "steps".
  • Because a single Batch Process Step executes a single Activity configuration, they are often referred to by their referenced Activity as well. For example, a "Recognize step".

Batch Process: settings Batch Process nodes are crucial components in Grooper's architecture. A Batch Process is the step-by-step processing instructions given to a inventory_2 Batch. Each step is comprised of a "Code Activity" or a Review activity. Code Activities are automated by Activity Processing services. Review activities are executed by human operators in the Grooper user interface.

  • Batch Processes by themselves do nothing. Instead, they execute edit_document Batch Process Steps which are added as children nodes.
  • A Batch Process is often referred to as simply a "process".

Batch: inventory_2 Batch nodes are fundamental in Grooper's architecture. They are containers of documents that are moved through workflow mechanisms called settings Batch Processes. Documents and their pages are represented in Batches by a hierarchy of folder Batch Folders and contract Batch Pages.

Document Viewer: The Grooper Document Viewer is the portal to your documents. It is the UI that allows you to see a folder Batch Folder's (or a contract Batch Page's) image, text content, and more.

Execute: tv_options_edit_channels Execute is an Activity that runs one or more specified object commands. This gives access to a variety of Grooper commands in a settings Batch Process for which there is no Activity, such as the "Sort Children" command for Batch Folders or the "Expand Attachments" command for email attachments.

Microsoft Office Integration: Grooper's Microsoft Office Integration allows the platform to easily convert Microsoft Word and Microsoft Excel files into formats that Grooper can read natively (PDF and CSV).

Project: package_2 Projects are the primary containers for configuration nodes within Grooper. The Project is where various processing objects such as stacks Content Models, settings Batch Processes, profile objects are stored. This makes resources easier to manage, easier to save, and simplifies how node references are made in a Grooper Repository.

Recognize: format_letter_spacing_wide Recognize is an Activity that obtains machine-readable text from contract Batch Pages and folder Batch Folders. When properly configured with an library_booksOCR Profile, Recognize will selectively perform OCR for images and native-text extraction for digital text in PDFs. Recognize can also reference an perm_mediaIP Profile to collect "layout data" like lines, checkboxes, and barcodes. Other Activities then use this machine-readable text and layout data for document analysis and data extraction.

About

Microsoft Office integration allows a Grooper user to leverage the native text of files generated in the Microsoft Office Suite such as Microsoft Word documents and Microsoft Excel spreadsheets. This feature can pull the native text from and perform type-specific activities on these files.

Supported File Types

  • Microsoft Word documents (.doc and.docx)
    • For Word documents, you can generate a Grooper-usable document with the Execute activity, using the Word to PDF command for the Word Document object type. The PDF will contain all the native text from the Word document, obtainable for further Grooper processing using the Recognize activity.
  • Microsoft Excel spreadsheets (xls and xlsx)
    • For Excel documents, you can generate a Grooper-usable document with the Execute activity, using the Excel to CSV command for the Excel Document object type. CSV files are natively readable by Grooper in version 2.90. The Recognize activity is not required.

How to Use

To make use of this feature, ensure that Microsoft Office is installed on the machine running Grooper Design Studio.

Furthermore, the bit version of Grooper and Microsoft Data Access Components (MDAC) must match.

  • If you are running the 64-bit version of Grooper, you must use the 64-bit MDAC components.
  • If you are using the 32-bit version of Grooper you must use the 32-bit MDAC components.

FYI

If you have completed these prerequisites and both Office and the appropriate MDAC component's installed, you may see this error message if you select a Batch Folder in a Batch with an attached Excel file.

If this is case, try closing Grooper Design Studio and re-open it as an administrator. This will create and/or provide access rights to the required directory indicated by the error message.

Ad Hoc Execution: Testing in Grooper Design Studio

Like any Activity, the Execute activity can be applied to a document in an "ad hoc" manner in Grooper Design Studio. This is typical for Grooper architects testing and designing solutions before building a Batch Process.

Microsoft Word needs to be installed on the computer/server where Grooper is installed for the conversion to work. This only applies to Microsoft Word. Grooper will still convert Excel to CSV, even if Excel is not installed on the machine.

Getting a Result with Microsoft Word Documents

  1. In Grooper Design Studio select a Batch that contains the desired documents.
  2. Right click the document whose native text you wish to view.
    • Notice imported Word documents will have the Word icon next to the native file's file name on the Batch Folder. This lets you know the document folder's native file (the one imported into Grooper) is a Word file.
  3. Select "Word Document".
  4. Select "Word to PDF".

This will create a PDF copy of the Word document, stored on the document folder. This document is now viewable in Grooper's Document Viewer and contains all the native text data from the Word file.

  1. To view the PDF, click on the icon in the top right corner of the Batch Viewer.
  2. From the drop-down, select "PDF".

This document folder can now be processed by the Recognize activity to extract that native text for further document processing (classification, data extraction etc).

Excel Spreadsheets

  1. In Grooper Design Studio select a Batch that contains the desired documents.
  2. Right click the document whose native text you wish to view.
    • Notice imported Excel documents will have the Excel icon next to the native file's file name on the Batch Folder. This lets you know the document folder's native file (the one imported into Grooper) is an Excel file.
  3. Select "Excel Document".
  4. Click "Convert to CSV...".

  1. When the "Convert to CSV" window pops up, select the Save As property using the dropdown menu and select one of the following options:
    • Children
    • Files
    • Attachment

Children

The Children option will convert the Excel file to a CSV file and saves the results as child object(s).

This is the most typical, configuration option. As seen in this image, if there are multiple sheets, they will be saved as multiple child objects of the document folder. The native Excel file had two sheets. So we get two child CSV files.

Files

The Files option will convert the Excel file to a CSV file and save the result as a new file. The new file is stored on the Batch Folder with the native file (More specifically, it is stored in the file store location associated with the Batch Folder.)

  1. To see the files, click on the document in the Batch in the node tree.
  2. Click on the "Advanced" tab.
  3. Here we can see we have two .csv files in addition to the .xlsx file.

Attachment

The Attachment option will convert the Excel file to a CSV file and replace the native file.

This is a "true" conversion. Rather than making a CSV copy of the Excel file in one way or another, the original Excel file is transformed into a CSV file.

Seen in this image, the original native file has been converted from an XLSX file to a CSV file.

  • The file's name has changed from "MOCK_DATA_EXCEL_MULTISHEET.xlsx" to "MOCK_DATA_EXCEL_MULTISHEET.csv"

FYI

The original Excel file in this case had two sheets. The Convert option will combine the rows from multiple sheets into a single sheet, one after the other, with a blank row inserted between each sheet.

Batch Processing Execution: Automating the Conversions

When automating a Word to PDF or Excel to CSV conversion, you will add the Execute activity to a Batch Process. Once added to a Batch Process, its configuration to convert Word files into PDFs and Excel files into CSVs is the same as described above.

To add the Execute activity to a Batch Process:

  1. An Execute Batch Process Step needs to be added to a Batch Process.
  2. Configure the Execute step as described above to convert native Word or Excel files for subsequent processing.
    • Add a Word to PDF command for the Microsoft Word object type to convert Word files to PDFs.
    • Add an Excel to CSV command for the Microsoft Excel object type to convert Excel files to CSV.