Generate PDF will create a PDF document from one or more image-based pages, optionally including text-behind and/or annotations.
The Generate PDF activity makes PDF documents out of images in a Batch. It takes Batch Page images inside a Batch Folder, generates a PDF from them, and saves the PDF file on the folder object. If any text data was obtained from OCR during a Recognize activity, it can be included as searchable text in the PDF.
Generate PDF can also add native-PDF elements to the generated PDFs, including highlighted fields and signature, checkbox, textbox, and radio button widgets. These are referred to as “Field Annotations” in Grooper. When you add an annotation in the Generate PDF activity, it will reference one or more Data Fields in a Data Model. For example, if you have a Data Field for a Social Security Number on a document, you can set up a Field Annotation to highlight the Social Security Number on the page after Grooper extracts it.
A number of native-PDF elements can be added to PDF documents on export as a part of the Generate PDF activity. The PDF elements that can be added are:
- Signature widgets
- Checkbox widgets
- Textbox widgets
- Radio group widgets
Multiple widgets may be added using the annotation editor. The widget is positioned relative to the position an extracted data element (such as a data field) was extracted on a page. Padding, increasing the widgets size in all four directions, may be specified for each annotation as well.
|A portion of an invoice with a non-padded Highlight Annotation on a field extracting the invoice number.||A portion of an invoice with a padded Highlight Annotation on a field extracting the invoice number. Each side was padded by 0.05in for this example.|
Generate PDF is a new activity in 2.80. Prior to 2.80, a PDF version of a document could be created using the Document Export activity. However, it did not have the ability to include native-PDF elements. Generate PDF also differs in that the activity saves a .pdf file on the document folder object in Grooper, meaning that file will persist through other steps in the batch until its export. For PDFs created using Document Export, they are created at the time of export. Note, since the PDF files created by Generate PDF stay on the folder object in a batch, this will increase the size of your batch in memory.
In both Generate PDF and Document Export, version 2.80 gives users the option to "linearize" PDF documents. Linearization optimizes documents for web viewing, speeding up their load time. Linearization is to PDF documents what buffering is to videos.
PDF annotation is particularly useful in cases where PDFs must be handled electronically after they exit Grooper (e.g. by being signed or having additional information added). Highlighting can be used to increase the prominence of particular parts of documents for review, discovery, or other reasons.
How To: Configure the Activity
Before you begin
If you want to include text from OCR in the PDF, you will also need to run the batch through a Recognize activity to do obtain the OCR text data.
Critically important if you want to include PDF widget annotations, you must have extracted data elements from your data model. You must run your batch through the Extract activity first if you want to take advantage of Generate PDF's highlighting, signature, checkbox, or any other annotation. The only way Generate PDF "knows" where to place the annotation is if it knows where the field is. Grooper will not have any way of determining that until the field is extracted.
Adding a Field Annotation1. In the Generate PDF property panel, select "Annotations" and press the ellipsis button at the end of the property.
2. A new window will appear. Press the "Add" button.
3. Select which annotation you wish to add.
We will select the Highlight Annotation here. More information on each annotation will be detailed in the next section.
4. Select the "Fields" property and expand the dropdown list. Select which data element you wish to apply the annotation to. Annotations may be applied to Data Fields or Data Columns from a data model. You can select multiple data elements for each annotation by checking multiple boxes.
5. Optionally, any annotation can be "padded". Padding increases the dimensions of the annotation. For example, the highlight annotation highlights the smallest rectangle that can be drawn around the extracted text. You may want to increase the size of the highlight for the viewer's sake. Most people will pad each side by an equal amount, like the invoice example from earlier in the article. However, you may also increase each side by an individual amount as seen in the image below.
Inches (in), points (pt), millimeters (mm), and centimeters (cm) are available as units.
6. When finished adding annotations and configuring them, press the "OK" button.
Configure the remaining properties
The "Jpeg Compression Quality" property changes the compression size of the PDF based on a percentage of the original image quality.
The "Make Searchable" property controls whether OCR text data is included with the PDFs. Set this to “True” to add searchable text to each page. This text will be the OCR data Grooper produced during the Recognize activity.
The "Linearized" property controls whether or not the PDF is linearized. Linearizing a PDF makes viewing them over the web faster. If not linearized, the entire PDF document must be downloaded first. Linearization is to PDFs what streaming is to videos.
Setting the "Delete Source Pages" property to “True” will delete the pages inside the folder used to create the PDF. This can help save memory space, if your system is running low on resources.
Verify the result
You can verify the activity created a .pdf file by going to the "Files" tab in the "Advanced" tab on a document folder object in the Node Tree.
1. After running the Generate PDF activity, navigate to a document folder object in a batch in the Node Tree.
2. Navigate from the "Batch Folder" tab to the "Advanced" tab.
3. Navigate from the "General Info" tab to the "Files" tab. You will see a .pdf file with the document folder's name. Select it and you will see the resulting PDF from the activity.
The Highlight Annotation highlights OCR text extracted from a Data Field.
The example below is a portion of a document where the countries listed were extracted. The Highlight Annotation is highlighting the text with 0.05 padding.
You can also adjust the appearance of the highlight using the "Appearance" properties, including its color, border, and transparency.
The Signature Widget creates a signature box for electronically signing PDFs.
In the example below, a "Signature" data field was created using Zonal Extract as its Value Extractor. The box created as the extraction zone becomes the signature box for the annotation.
The Textbox Widget creates an editable textbox around extracted values.
In the example below, the extracted date is used as the annotated data field. As you can see, the value has been populated in a now editable textbox.
Using the "Appearance" properties, you can change the font and font size of the text in the box.
The Radio Group Widget allows you to create clickable radio button choices out of OMR boxes.
In the example below, the "Yes" and "No" were used as the labels for the Anchored OMR extraction method. On the original document image, these were checkboxes.
Using the Radio Group Widget annotation, their checkboxes have been turned into radio buttons with the "Yes" button pressed, as seen below.
|Annotations||(0 Field Annotation objects)||Press the ellipsis button at the end of this property to bring up a new window. Use the “Add” button to add new Field Annotations. Each annotation must be assigned a Data Field in a Data Model. The size of the annotation on the PDF can be altered by configuring its “Padding” property.|
|JPEG Compression Quality||75%||This changes the compression size of the PDF based on a percentage of the original image quality.|
|Make Searchable||False||Set this to “True” to add searchable text to each page. This text will be the OCR data Grooper produced during the Recognize activity.|
|Linearized||False||Linearizing a PDF makes viewing them over the web faster. If not linearized, the entire PDF document must be downloaded first. Linearization is to PDFs what streaming is to videos.|
|Delete Source Pages||False||Setting this to “True” will delete the pages inside the folder used to create the PDF.|