Convert Image Based Documents to a Text Searchable PDF (Simple Functionality)

From Grooper Wiki

This article is about the current version of Grooper.

Note that some content may still need to be updated.

2025

You may download the ZIP and PDF below for use in your own Grooper environment (version 2025). There is a Project ZIP file, as well as a PDF to be used when creating a Batch.

Introduction

This article demonstrates one of the most foundational document processing workflows in Grooper: transforming non-searchable, image-only PDF or TIFF files into fully text-searchable PDFs using OCR and a basic three-step Batch Process.

The intention of this article is to provide a clear, practical example of how Grooper’s core components—an Activity Processing service, an OCR Profile configured with Azure Computer Vision, and a simple Batch Process consisting of Split Pages, Recognize, and Merge—work together to produce a searchable output file. Rather than focusing on complex extraction or data modeling, this walkthrough highlights the minimum required configuration to achieve a common and highly valuable outcome.

The article also demonstrates two ways to test and execute the process within Grooper:

  • Using Batch Process Step Testers from the Design page for controlled, step-by-step validation.
  • Using the Upload Documents button on the Batches page for a more streamlined, user-driven experience.

By the end of this guide, you will understand how Grooper performs OCR, embeds recognized text into a newly generated PDF, and replaces the original image-based file with a searchable version—illustrating the essential mechanics behind many more advanced Grooper solutions.

Test using Batch Process Step Testers

This portion of the article walks through executing the Batch Process manually from the Design page using Grooper’s Activity Tester tabs of Batch Process Steps. This approach is ideal for understanding exactly what each step in the process does and for validating configuration during development.

Users create a test Batch, import an image-based PDF or TIFF file, and then execute each Activity—Split Pages, Recognize, and Merge—individually. By selecting nodes in the Batch Process and running them through the Activity Tester tabs of Batch Process Steps (either via Test Activity or Submit Job), users can observe how:

  • The document is separated into Batch Pages.
  • OCR is performed using the configured Azure OCR Profile.
  • A new searchable PDF is generated and attached to the Batch Folder.

This method emphasizes transparency and control. It allows developers and administrators to verify service connectivity, OCR configuration, and output behavior at each stage before deploying the process for general use. It is especially useful for troubleshooting and learning how Grooper’s Activity Processing architecture executes tasks.

  1. First, let's start by importing the Project ZIP file provided for this exercise. Select the Projects folder from the Node Tree, then click the "Upload ZIP" button.
  2. In the "ZIP File Import" dialogue that opens, click the "Choose File" button.
  3. An Explorer window will open allowing you to select the provided Project ZIP file. Click the "Open" button.
  4. Click the "UPLOAD" button in the "ZIP File Import" dialogue.
  5. The provided Project is now added in the Projects folder.
  6. Next, let's make sure we have a running Activity Processing service. Click the Machines folder from the Node Tree. Notice there is a running Activity Processing service for this repository. This is required if you choose to use the "Submit Job" command moving forward. It will also be required to automate processing in a Batch Process. Please refer to the "How to: Grooper services" section of the "Grooper Command Console" article of the Grooper Wiki for more information on installing Grooper Services.
  7. Moving on, we'll now configure the "Azure OCR" OCR Profile that will be leveraged by the Recognize activity to add electronic text to our image based PDF. Select the "Azure OCR" OCR Profile, then insert your Azure OCR API key into the API Key property.
  8. Next, click the drop-down button to the right of the API Region property and select the appropriate API Region for your API key from the drop-down menu.
  9. Click the "Save" button to save changes made to the OCR Profile.
  10. Let's now add a test Batch. Expand the Node Tree and right-click the Batches Test folder, then select "Add Batch" from the pop-out menu.
  11. In the "Add" dialogue, provide a name for the Batch in the Name property, then click the "Execute" button. Here we've named it "Make Searchable".
  12. Select the newly created Batch from the Node Tree, then click the Viewer tab.
  13. From an Explorer window, drag the provided PDF onto the root Batch Folder of the Batch.
  14. Now we will begin to test the Batch Process Steps of our Batch Process to affect the contents of our Batch, starting with "Split Pages". Expand the Node Tree and select the "Split Pages" Batch Process Step, then click the Activity Tester tab.
  15. Click the "Select Batch" button in the Batch Viewer, then be sure to select the newly created Batch, and click the "OK" button.
  16. Select the Folder Level 1 Batch Folder from the Batch Viewer, then click the "Test Activity" button. You must select the appropriate scope, as configured on the Batch Process Step's properties to use this command. This will use local system resources to process the activity one task at a time. In this case, there's one Folder Level 1 Batch Folder, so it's one task.
  17. Using the "Submit Job" button is also an option. You do not need to make a selection in the Batch Viewer to use this command. This will create a Job with a number of tasks from the configured scope. Once again, in this case, one Folder Level 1 Batch Folder will create one Task in the Job. The task, or tasks, of the Job will be picked up by an active Activity Processing service to be processed.
  18. Once completed, a number of child Batch Page objects will be created by the Split Pages activity. You can view these in the Batch Viewer.
  19. We can also see the Batch Pages as nodes in the Node Tree. Expand the Node Tree and select the "MakeSearchable" Batch, then click the "Refresh" button.
  20. Expand the contents of the Batch to see the Batch Page object nodes in the Node Tree.
  21. Next we'll test the Recognize activity. Select the "Recognize" Batch Process Step, then click the Activity Tester tab.
  22. With no selection in the Batch Viewer, you can click the "Submit Job" button.
  23. Conversely, you can select all the Batch Pages, then click the "Test Activity" button. Once completed, each Batch Page will have a "CharacterData.txt" file associated with them that contains the now recognized electronic text.
  24. Finally, select the "Merge" Batch Process Step. Notice the Merge Format property is set to "PDF Format".
  25. Notice also that the Searchable sub-property of the Build Options property is set to "True", and the Deduplicated sub-property is also set to "True".
  26. The Always Build property is set to "True" as well. The Attachment Name property is using a simple expression to name the created PDF with a randomly generated GUID.
  27. Click the Activity Tester tab.
  28. Select the Folder Level 1 Batch Folder from the Batch Viewer, then click the "Test Activity" button.
  29. Conversely, you can simply click the "Submit Job" button.
  30. A PDF attachment with a new GUID will be added as a file to the Batch Folder.
  31. Double-click the attachment of the Batch Folder.
  32. Click the "OK" button in the "Confirmation" dialogue.
  33. Open the saved PDF from where your browser saves files.
  34. You will now have a PDF file with highlightable, text searchable functionality.

Test with the "Upload Documents" button

This portion of the article demonstrates running the same Batch Process through the Batches page using the Upload Documents button. After publishing the Batch Process, users can select a file and immediately associate it with the “Searchable PDF Process” from a dropdown list.

Once started, the Batch runs automatically through all three steps—Split Pages, Recognize, and Merge—without requiring manual interaction at each stage. As long as the Activity Processing service is running, the process completes end-to-end and replaces the original image-based document with a newly generated text-searchable PDF.

This approach highlights the user-facing experience. It reflects how a deployed solution behaves in production, where users simply upload documents and receive processed output without needing to understand the underlying configuration. It demonstrates Grooper’s ability to encapsulate complex processing logic into a simple, accessible workflow.

  1. First we need to publish our Batch Process. Expand the Node Tree and select the "Searchabe PDF Process" Batch Process from the provided Project, then click the "Publish" button to publish this Batch Process to the Processes folder.
  2. Click the "Execute" button in the "Publish" dialogue.
  3. A published version of the Batch Process now exists in the Processes folder. This Batch Process will now be available to select for production Batches.
  4. Next, we'll use the "Upload Documents" button from the Batches Page to create a new Batch and leverage our published Batch Process to automate the processing. Click the "Batches Page" button to go to the Batches Page.
  5. Click the "Upload Documents" button in the upper right of the Batches Page. This allows us to quickly create and begin processing a small, ad hoc, Batch.
  6. Click the "Choose Files" button in the dialogue that appears.
  7. Select the provided PDF document from the Explorer window that opens, then click the "Open" button.
  8. Set the Process property to the published "Searchable PDF Process" Batch Process, then click the "OK" button.
  9. Select the newly imported Batch. If the Batch started paused, you may need to click the "Resume" button to begin processing.
  10. On the Jobs tab of the Batch Info Viewer, you will see bars indicating the status of the processing of the Batch Process Steps as the running Activity Processing service completes the Tasks of each Job.
  11. Once processing is complete, double click the Batch.
  12. A Review tab will open in your browser, where you can interact with you Batch using the three main views of "Folder View", "Data View", and "Thumbnail View".
  13. On the "Folder View", double click the attachment of the Batch Folder.
  14. Click the "OK" button in the "Confirmation" dialogue that opens.
  15. Open the saved PDF from where your browser saves files.
  16. You will now have a PDF file with highlightable, text searchable functionality.

For more information

Please review the following articles for more information on these specific topics: