2023:PDF Data Mapping (Behavior)

From Grooper Wiki

This article is about an older version of Grooper.

Information may be out of date and UI elements may have changed.

20252023.120232021

PDF Data Mapping is a Content Type Behavior designed to enhance PDF files generated by the Merge or Export activities with metadata, bookmarks, annotations and/or different kinds of widgets.

You may download the ZIP(s) below and upload it into your own Grooper environment (version 2023). The first contains a Project with resources used in examples throughout this article. The second contains one or more Batches of sample documents.

About

The PDF Data Mapping behavior allows Grooper users to more fully leverage the capabilities of the PDF file type. The standard PDF Export Format in Grooper will use the page image files and their text data to create a multipage PDF file for each document folder upon Export. However, this is just the "display information" required to open and read the document. There's a lot more to what a PDF can be than just a multipage document with page images and machine readable text. PDF content can also include metadata, keywords, bookmarks, annotations, and more!

PDF Data Mapping creates an exportable PDF file that includes some of this additional content available to the PDF format. This is part of Grooper's evolving "Smart PDF Architecture". This is a design philosophy striving to more fully utilize the capabilities of the PDF file type and merge them with Grooper's own document processing capabilities.

The expanded PDF Data Mapping functionality can be divided into three categories:

  • Annotations
  • Bookmarks
  • Metadata

Annotations

Annotations are additional objects you can add to PDF documents. Grooper uses information from Data Elements in a Data Model collected during the Extract activity to add these annotations (also called "widgets"). These annotations can increase the readability and add components for the reader to interact with the document, such as checkboxes and signature boxes.

The kinds of annotations you can add are:

  1. Highlighting
  2. Radio group buttons
  3. Checkboxes
  4. Signature boxes
  5. Editable text boxes

Grooper uses the data instance information from extracted Data Fields to insert these annotations. For example, here we set up a Content Model with a Data Field named "Last Name". After the document's data was collected during the Extract activity, Grooper has a data instance it can associate with the "Last Name" Data Field, including its size and location coordinates on the document. We then used the Highlight Annotation to highlight the extracted last name on the document in yellow.

The size of all these annotations can also be adjusted using a Padding property if the size of the extracted data instance is too small for your needs.

Bookmarks

Bookmarks allow easy navigation for multipage PDF documents. When exporting a single PDF comprised of multiple child sub-documents, you can create bookmarks for each child document. This way, you can keep all the documents together in a single PDF file, easily navigating from one section of the document to another.

For example, this document is an application packet for a study abroad program. Each document in the packet was separated and classified as a child document folder of one Document Type or another. PDF Data Mapping was used to export the packet as a single PDF and a bookmark was inserted for each sub-document and named after its Document Type.

Grooper can create bookmarks from extracted Data Fields in the document as well.

Metadata

Metadata refers to a PDF file's content beyond the information required to display the document (the page images and encoded text data). Prior to implementing the PDF Data Mapping functionality, Grooper only had access to edit minimal PDF metadata, notably the file's name upon export. PDF Data Mapping allows Grooper to alter and store additional collected metadata as well, including Data Field values collected during the Extract activity. This means Grooper can now create a viewable document with all the extracted data associated with the document itself, independent of that data being stored elsewhere (such as a database table or content management system).

This metadata can be accessed by opening a PDF in a PDF viewer application, such as Adobe Acrobat, and opening the "Document Properties" window from the File menu.

There are several pieces of metadata Grooper has access to.

  1. All of the fields highlighted here can be created from Grooper, using an expression based syntax to access data extracted from the document and system information.
  2. Note this gives Grooper the capability to generate and insert keywords into the PDF's "Keywords" field.
    • In this case, Grooper has created a keyword based on the word count length of the essay in this study abroad application packet.
  3. Extracted Data Field values can also be exported as PDF metadata. This information can be viewed either using the "Custom" tab or the "Additional Metadata..." window.

  1. In the "Custom" tab...
  2. You can see all the Data Fields Grooper extracted and their values as custom metadata for this document.


Be aware the PDF file format has metadata fields already named "Title", "Author", "Subject", "Keywords", "Creator", "Producer", "CreationDate", "ModDate" and "Trapped".

You may run into an issue upon export if you have Data Fields in your Data Model who share one of these names. If using the Metadata creation capabilities of PDF Data Mapping, consider these names "taken" and adjust the name of the Data Field to be something different. For example, in this case a Data Field returning the title of the proposal listed on the application was changed from "Title" to "Title of Proposal"

As a Behavior, PDF Data Mapping is configured on a Content Type object, commonly a Content Model or a Document Type.

  1. Here, we have selected a Content Model in the Node Tree.
  2. To add a Behavior, select the Behaviors property and click the ellipsis button at the end.
  3. This will bring up a dialogue window to add various behaviors to the Content Model, including PDF Data Mapping.
  4. Add PDF Data Mapping to the list by clicking on the "+" button.
  5. Select PDF Data Mapping from the listed options.

  1. Once added, you will see a PDF Data Mapping item added to the Behaviors list.
  2. Selecting this Behavior, you will see property options to configure PDF creation.


The expanded PDF Data Mapping functionality can be divided into four categories:

  • Metadata
  • Bookmarks
  • Piece Info
  • Annotations


Before we get into what these properties do, how to configure them, and how they effect the exported PDF, there's one key thing to keep in mind when using PDF Data Mapping.

Along with the PDF Data Mapping Behavior, you will also need an Export Behavior configured to export a PDF formatted file. The PDF Data Mapping Behavior does the job of configuring all the extra content (metadata, bookmarks and/or annotations) you want to add to the exported PDF. The Export Behavior does the job of actually creating the PDF (with the content configuration information supplied by the PDF Data Mapping) and sending it off to an external storage platform.

Export Behaviors can be added to Content Types, such as the Content Model here.

  1. To add an Export Behavior, press the "+" button in a Behaviors list collector.
  2. Select Export Behavior.


FYI

Export Behaviors can also be configured on the Export activity as local Export Behaviors to the activity configuration.

The benefit to adding it to a Content Model is you will often use information collected from a Content Model upon exporting your documents, such as a document folder's classified Document Type or collected data from a Data Model for field mapping purposes. You might as well do it now, adding it to the Content Model while you're adding the PDF Data Mapping.

There are many different ways to configure an Export Behavior. See the Export (Activity) - 2023 article for setting up Export.

For PDF Data Mapping, there are two things that are important to have on your Export Behavior:

  1. Make sure you have "PDF Format" in your "List of Export Formats".
  2. Turn Always Build on your "PDF Format" to True.

We will explain why these are important later in the article.

How To

The following tutorials use a mock UNESCO Laura W. Bush Traveling Fellowship application to detail a more specific set up for a PDF Data Mapping. This is a packet of documents from a single applicant containing five different kinds of documents.

Application

This document consists of two pages. The first is a coversheet for the whole application packet. The second is the application form itself.

Primarily, this document will allow us to demonstrate the different kinds of annotations available when using a PDF Data Mapping to generate a PDF file (using its Annotations property configuration). We will see how to set up one example of each of the following annotation types available in Grooper:

  • Highlight Annotation
  • Checkbox Widget
  • Radio Group Widget
  • Signature Widget
  • Textbox Widget

Importantly for any annotation type, a Data Field must be extracted in order to place the annotation. How does Grooper know what you want to highlight? It uses the extraction result of a Data Field, which includes information about where that value is located on the page. Even if the extraction result is just a blank zone without returning any actual information, Grooper needs some kind of coordinates to know where to place the annotation.

Since we're going to end up extracting some data in order to place these annotations, this will also give us the opportunity to see some of the collected data inserted as PDF metadata as well.

Essay

This application also includes an essay from the student. This document will demonstrate how to add keywords to the PDF's metadata.

We will use an extractor to count the number of words in the essay and configure the PDF Data Mapping's Metadata properties to insert a keyword of "long essay", "medium essay", or "short essay" depending on the essay's length.

Other Documents

This packet contains three other kinds of documents as well:

  • a proposal summary
  • the applicant's resume
  • and a letter of recommendation.

These documents (as well as the rest) will allow us to see how to insert bookmarks into the generated PDF, using the PDF Data Mapping's Bookmarking property configuration.

The original document, imported as a single multipage PDF file, has been processed a bit to facilitate this.

  1. See here this document folder in the Batch is classified as an "UNESCO Application Packet" Document Type. This Batch Folder was created upon importing the original application packet file, named "UNESCO Packet.pdf",
  2. The PDF document's pages were split out using the Split Pages activity to create child Batch Page objects. This allowed us to separate the pages into child document folder for each of the documents inside the imported application packet.
  3. PDF Data Mapping can create a bookmark in the generated PDF for each of these five sub documents using the Bookmarking property. Each bookmark will be named after their classified Document Type (i.e. "Application", "Proposal Summery", "Resume", etc.).

This means we can process the full imported application packet document, and export a single file with easily navigable bookmarks for its component documents. There's no need to export individual documents for each component document and figure out a way to index them, or put them in their own folder, or any other method you may come up with to relate them to each other in their final storage location. With the PDF Data Mapping's bookmarking capabilities, you can export just one file with each child Document Type bookmarked.

Configure PDF Generation for Annotations

BE AWARE: PDF Data Mapping cannot insert annotations on PDF pages with form fields.

If a PDF page is form-fillable, it is ill advised to insert annotations and widgets on top of these form fields. This can result in a corrupted PDF when it is generated by Merge or Export. PDF Data Mapping will not allow you to insert annotations and widgets on PDF pages with form fields.

About

PDF Data Mapping has the capability of inserting various annotations and native pdf widgets into the generated PDF. This increases the document's readability and adds functionality for the reader to interact with the document through widgets such as radio group buttons, checkboxes and signature fields.

We will demonstrate how to configure one example for each of the Annotation Types.

  1. Highlight Annotation
    • We will use Grooper to highlight the extraction result for the applicant's name on the document.
  2. Radio Group Widget
    • Radio buttons are useful for documents when you have a collection of choices listed and can only select one option. Such is the case for the "US Citizen" field on this document. You either are or are not a US Citizen and can answer "Yes" or "No". We will insert a radio group widget into this document to allow the user to toggle between these choices.
  3. Checkbox Widget
    • It seems every standard form uses checkboxes for one thing or another. This annotation will allow us to insert checkable checkboxes into the PDF file if located using OMR based extraction techniques. For example, the checkboxes here next to each checklist item for the application packet.
  4. Signature Widget
    • With the Signature Widget we can create a form-fillable signature box for the generated PDF. Notice the document as imported is not signed. With the PDF Data Mapping Behavior we can add a signature box to the processed file. This way you could send the application back to the applicant and have them sign the document digitally.

We will also use the Textbox Widget to insert editable text boxes into the document's coversheet. These text boxes will also be populated with some corresponding information from the rest of the document.

  1. A textbox will be created for the "Candidate" on the coversheet and populated with the applicant's first name, middle initial and last name (Dog O Doggerson).
  2. A textbox will be created for the "Title" on the coversheet and populated with the proposal title (Who's a Good Boy?)
  3. A textbox will be created for the "Country of Travel" on the coversheet and populated with the proposed travel country for the study abroad program (Japan).

Prereqs - Data Fields and Extracted Data

Before a PDF annotation can be generated, a document's data must be extracted. Put another way, the Extract activity must run before the Export activity (when the PDF Data Mapping ultimately builds the PDF and exports it).

Each of the Annotation Types point to a Data Field in a Data Model as part of their configuration. If the Data Field does not collect data during the Extract activity, the PDF Data Mapping won't know where to place the annotation.

  1. We will ultimately configure PDF Data Mapping using the Behaviors property of this Content Model which we've named "PDF Data Mapping - UNESCO Packet".
    • Before we do that, we will need to ensure we have Data Fields that correspond to the annotations we want to place.
  2. We've added the necessary Data Fields to the Content Model's Data Model.
  3. The "Candidate", "Title of Proposal", and "Country of Travel" Data Fields will be used to place the Textbox Widget annotations.
  4. The "Last Name", "First Name", and "Middle Initial" Data Fields will be used to place the Highlight Annotation annotations.
  5. The "US Citizen" Data Field will be used to place the Radio Group Widget annotation.
  6. The "Application", "Proposal Summary", "Essay", "Resume" and "Recommendation Letter" Data Fields will be used to place the Checkbox Widget annotations.
  7. The "Signature" Data Field will be used to place the Signature Widget annotation.

Add the Behavior

Annotations are one of the configuration options for the PDF Data Mapping Behavior. A Content Type Behavior can tell an activity (specifically the Export activity, in the case of PDF Data Mapping) how to use the Content Type to do something (how to use the Content Model's collected Data Fields to insert additional content when generating a PDF upon export, in this case).

  1. All Behaviors are added to a Content Type object.
    • We will add the PDF Data Mapping Behavior to this Content Model named "PDF Data Mapping - UNESCO Packet".
  2. All Behaviors are added using the Behaviors property. Select the Behaviors property and press the ellipsis button at the end to add PDF Data Mapping.
  3. This will bring up the Behaviors editor window.
  4. Click the "+" button to add a Behavior.
  5. Choose PDF Data Mapping from the list.

  1. Once added, you will see PDF Data Mapping added to the list on the left. Select it to add an Annotation.
  2. In the right panel, select the Annotations property and click the ellipsis button at the end.
  3. This will bring up an Annotations collection editor.

We will detail collection and configuration of the various Annotation Types in the next tabs of this tutorial.

Highlight Annotation

We will look at the Highlight Annotation first. This annotation is what it sounds like. You can use it to highlight portions of a PDF.

In this example, we will use the Highlight Annotation to highlight the extracted "Last Name", "First Name" and "Middle Initial" fields from the application form.

Before Annotation

After Annotation

  1. In the Annotations collection editor, press the "+" button to add the Highlight Annotation annotation.
    • Refer to the previous tab if you are unclear how we got to this window.
  2. Select Highlight Annotation from the list.

  1. This will add a Highlight Annotation to the Annotations list.
  2. The only configuration that is strictly required is to indicate which Data Fields you wish to highlight. Click the ellipsis icon next to the Fields property to select which Data Fields you wish to highlight.
    • Whatever result is returned by the selected Data Fields will be used to create the highlighted annotation.
  3. In the window that pops up, mark the checkboxes next to the Data Fields you wish to highlight.
    • In our case, we are choosing the "Last Name", "First Name", and "Middle Initial" Data Fields. Once collected by the Extract activity, Grooper will know where these results are located on the document. The Highlight Annotation annotation will then highlight the document as seen in the "After Annotation" image above.

Optionally, you can control how the highlight looks. Its color, size, opacity and whether or not there's a stroke around the highlighted rectangle.

  1. For instance, we set the Padding property to 0.1in
    • This will increase the size of the highlight rectangle by 0.1 inches on all sides.
    • All annotations have the ability to be padded to increase their size, not just Highlight Annotation.
    • You can also expand the Padding property's sub properties to adjust specific configurations for padding the Left, Top, Right, and Bottom' edges.
  2. While we did not choose to do so, you can add a colored border around the highlighted rectangle by choosing a Border Style (such as Solid for a solid border or Dashed for a dashed line border)
    • The Border Color and Border Width properties will further help you configure the border produced.
    • Note: While the Border Color and Border Width properties are configured to 64, 64, 64 and 1pt by default, the Border Style is set to None by default. With no border produced, these properties are ignored. They will not be used to create a border until you choose a Border Style.
  3. We also set the Fill Color to Yellow.
    • Grooper defaults to green. This is the same green you see extraction results highlighted when you're testing out extractors in Grooper.
    • You can select colors using a dropdown list or use comma-separated values in the RBG color space. For example, "yellow" is also 255, 255, 128 in the RBG color space.

Radio Group Widget



The Radio Group Widget annotation allows you to add radio buttons to the document. Radio buttons are common PDF elements used to indicate a single choice from multiple options in a list. This Annotation Type uses OMR extraction techniques (such as Labeled OMR and Zonal OMR) to find existing checkboxes on the document. A group of radio buttons are then overlaid on top of the checkboxes when the PDF Data Mapping behavior builds the PDF file.

For example, we will create a Radio Group Widget annotation from the "US Citizen" Data Field's result. We have two choices, either "Yes" or "No". Only one or the other can be chosen. So, this is well suited for a radio button group.

Before Annotation

After Annotation

  1. In the Annotations collection editor, click the "+" button to add the Radio Group Widget annotation.
    • Refer to the "Add the Behavior" tab if you are unclear how we got to this window in Grooper Design Studio.
  2. Select Radio Group Widget from the list.

  1. This will add a Radio Group Widget to the Annotations list.
  2. The only configuration that is strictly required is to indicate which Data Fields you wish to use to create the radio buttons. Click the ellipsis icon to the right of the Fields property to select these Data Fields.
    • Whatever result is returned by the selected Data Fields will be used to draw and insert the radio buttons.
    • You may use the Padding property to adjust the size of the radio button if you desire.
    • These Data Fields must use an OMR based extraction method (Labeled OMR, Ordered OMR, or Zonal OMR) to insert the radio buttons.
  3. In the "Fields" window that pops up, click the checkbox next to the Data Fields you wish to use to create the group of radio buttons.
    • In our case, we are choosing the "US Citizen" Data Field. Once collected by the Extract activity, Grooper will know which results you want to use to create the radio buttons. This will include the checkbox locations and check states stored in the document's layout data. The Radio Group Widget annotation will then insert radio buttons into the generated PDF as seen in the "After Annotation" image above.

Let's briefly look at this "US Citizen" Data Field and see what's happening behind the scenes when PDF Data Mapping creates the radio buttons.

  1. We have selected the "US Citizen" Data Field in the Grooper Node Tree.
  2. This Data Field uses the Labeled OMR extractor to return its result, looking for checkboxes next to the labels "Yes" and "No" on the document.
  3. We're going to test the extraction by going to the "Tester" tab.

  1. Click the play icon to test the extraction.
  2. The box next to "Yes" is checked. This is ultimately the result returned to the "US Citizen" Data Field.
    • This is how the Radio Group Widget annotation knows where to place the radio button. The data instance used to insert the PDF radio button is drawn around the detected box (in this case highlighted in green in the Document Viewer).
    • Since this is the detected checked result, the radio button is configured as "pressed" upon outputting the generated PDF.
  3. Labeled OMR on this document is returning "Yes" as the result of extraction.
  4. The box next to "No" is not checked. The Radio Group Widget will also create radio buttons for the unchecked boxes next to labels on the document as well.
    • The alternate candidate data instances are used to insert the other PDF radio buttons in the group (in this case highlighted in red in the Document Viewer).
    • The unchecked boxes must be detected from a Box Detection or Box Removal IP Command in order to be inserted in the generated PDF. They must be present in the document's layout data file before the Extract activity runs.
    • Since this is detected as an unchecked result, the radio button is not pressed upon outputting the generated PDF.

FYI

In the case of every Annotation Type, PDF Data Mapping inserts the annotation by overlaying it on top of the document. This can be important to keep in mind for all annotations but is often particularly relevant when inserting radio buttons using the Radio Group Widget.

Notice the original image for this document used checkboxes, not radio buttons. We see an "X" inside of a square box.

The radio button annotations are simply overlaid on the page's image. You can actually see the edges of the square box persist in the generated PDF (Here, highlighted in yellow for your viewing pleasure).

In this case, the boxes were stored in the layout data using the Box Detection IP Command. This will find and store the checkbox locations and check states, but not actually alter the image in any way.

Maybe you care about this, and maybe you don't. If you do, you may consider using the Box Removal IP Command instead. Box Removal will also find and store the checkbox locations and their check states, but it will also digitally remove the checkboxes from the document's image.

In this case, the boxes were stored in the layout data using the Box Removal IP Command. Since the boxes are removed before the Export activity, the edges of the boxes are not present on the final image. The radio button annotations are placed on blank pixels.

Checkbox Widget

WPI

The Checkbox Widget documentation needs to be finalized after getting some guidance from dev. If it seems incomplete or images don't match up with text, that is why.



PDF Data Mapping also has the capability to insert form-fillable checkboxes as well, using the Checkbox Widget Annotation Type. This Annotation Type also uses OMR extraction techniques (such as Labeled OMR and Zonal OMR) to find existing checkboxes on the document. It works a lot like the Radio Group Widget annotation, just instead of radio buttons, editable checkboxes are overlaid on the document.

For example, we will create a Checkbox Widget annotation for the checkboxes in the "Checklist" section of this document, the "Application", "Proposal Summary", "Essay", "Resume" and "Recommendation Letter" Data Fields. These are Boolean OMR checkboxes, returning "true" if the box next to the corresponding label is checked, and "false" if unchecked. In either case, checked or not, the Checkbox Widget will insert an editable checkbox element into the generated PDF.

Before Annotation

After Annotation

  1. In the Annotations collection editor, click the "+" button to add the Checkbox Widget annotation.
    • Refer to the "Add the Behavior" tab if you are unclear how we got to this window in Grooper Design Studio.
  2. Select Checkbox Widget from the list.

  1. This will add a Checkbox Widget to the Annotations list.
  2. The only configuration that is strictly required is to indicate which Data Fields you wish to use to create the checkboxes. Click the ellipsis icon to the right of the Fields property to select these Data Fields.
    • Whatever result is returned by the selected Data Fields will be used to draw and insert the checkboxes.
    • You may use the Padding property to adjust the size of the checkboxes if you desire.
    • These Data Fields must use an OMR based extraction method (Labeled OMR, Ordered OMR, or Zonal OMR) to insert the checkboxes.
  3. In the window that pops up, check the boxes next to the Data Fields you wish to use to create the checkboxes.
    • In our case, we are choosing the "Application", "Proposal Summary", "Essay", "Resume" and "Recommendation Letter" Data Fields. Once collected by the Extract activity, Grooper will know which results you want to use to create the checkboxes. This will include the checkbox locations and check states stored in the document's layout data. The Checkbox Widget annotation will then insert checkboxes into the generated PDF as seen in the "After Annotation" image above.

Signature Widget



Form-fillable signature boxes can be inserted using the Signature Widget annotation. This Annotation Type uses a zonal extraction type (such as Detect Signature or Highlight Zone) to draw the boundaries of the inserted signature widget. This allows you to create a document that can be digitally signed straight from Grooper upon exporting the generated PDF.

For example, we will create a Signature Widget annotation for the signature line on the application form, using the "Signature" Data Field of our Data Model. The Signature Widget will insert an interactable signature element into the generated PDF.

Before Annotation

After Annotation

  1. In the Annotations collection editor, click the "+" button to add the Signature Widget annotation.
    • Refer to the "Add the Behavior" tab if you are unclear how we got to this window in Grooper Design Studio.
  2. Select Signature Widget from the list.

  1. This will add a Signature Widget to the Annotations list.
  2. The only configuration that is strictly required is to indicate which Data Fields you wish to use to create the signature box. Click the ellipsis icon to the right of the Fields property to select these Data Fields.
    • Whatever result is returned by the selected Data Fields will be used to draw and insert the signature box widget.
    • You may use the Padding property to adjust the size of the signature box if you desire.
    • Zonal based extraction methods (such as Signature Detection and Highlight Zone) are typically used as the Data Field's extractor type.
  3. When the window pops up, check the boxes next to the Data Fields you wish to use to create the checkboxes.
    • In our case, we are choosing the "Signature" Data Field. Once collected by the Extract activity, Grooper will be supplied the size and location of the Data Field's extraction zone, which will form the size and location of the PDF signature widget. The Signature Widget annotation will then insert the form-fillable signature box into the generated PDF as seen in the "After Annotation" image above.

Just like any Annotation Type, the extraction result from the Data Field is critical for placing the signature annotation on the generated PDF. Let's look at the "Signature" Data Field's result to understand a little better how these results are used to create the signature widget.

In our case, we're using the Detect Signature extractor type to supply these results. The Detect Signature extractor is perfectly suited for the Signature Widget Annotation Type.

  • It actually combines both Zonal and OMR based extraction techniques to determine if a signature is present in the zone. It sets the boundaries of where you expect to find a signature using Zonal based methods and detects if the signature is present by counting the percentage of filled pixels in the zone, which is the basis of OMR based extraction methods. You can then output different values if the zone is filled above or below a certain percentage. In this case, the extractor returns "Not Signed" because there aren't enough pixels present in the extraction zone to count as filled. If there were a signature present, there'd be more pixels present, accounting for a higher filled percentage.

This is great for our purposes because it gives us the exact information we need for the Signature Widget, which is an extraction zone. Grooper needs a data instance indicating the size and location for the generated signature widget.

  • But wait there's more! We also get some bonus information about whether or not there's a signature present. Does the Signature Widget Annotation Type need to know if there's a signature present? No. It does not. It will place the widget no matter what the result is. But might that information be otherwise useful to you? Probably.
  1. We have selected the "Signature" Data Field in our Data Model.
  2. This Data Field uses the Detect Signature extractor to draw the extraction zone used to insert the signature widget.
  3. This extractor uses the Text Region Location option.
  4. This gives us the ability to anchor the extraction zone to an extractable text anchor, using the Text Extractor property.
    • In this case we've anchored the zone to the word "Signature" outlined in blue in the document viewer. Where do we want to place the extraction zone (and ultimately the signature widget)? On the signature line. How do we know where that line is? It's above the text label "Signature".
  5. The extraction zone itself is drawn using the Translation and Adjustment properties.
    • This allows us to set the size (Adjustment) and location (Translation) of the extraction zone (and ultimately the signature widget) relative to the Text Extractor's result.
    • The extraction zone will be the green rectangle in the document viewer.
  6. Click over to the "Tester" tab and test the extraction.

  1. When the PDF Data Mapping behavior builds the PDF, using the Signature Widget annotation, the extraction zone's size and location forms the inserted signature widget.

Textbox Widget



The Textbox Widget Annotation Type will insert editable text boxes into the generated PDF. One simple way to use this functionality is to use the Highlight Zone extractor type to place a blank zone where you want to place an empty text box on the PDF. However, any extractor type can be used to define the textbox's location. Furthermore, if the Data Field used to create the annotation collects a valued during the Extract activity, not only will a textbox be inserted into the generated PDF, but it will be prefilled with the Data Field's extracted value upon export.

For example, we will use the Textbox Widget functionality to fill out the blank coversheet on the first page of our application packet. We will end up using a Highlight Zone extractor to define the size and location of the text box. However, we're going to go one step further and populate the Data Field's used with some information from other Data Field's in our Data Model. By the end of it, PDF Data Mapping will not only insert editable textboxes into the generated PDF, but fill them in with text. By the end of it, we end up with this blank coversheet automatically populated with some information collected during the Extract activity.

Before Annotation

After Annotation

  1. In the Annotations collection editor, click the "+" button to add the Textbox Widget annotation.
    • Refer to the "Add the Behavior" tab if you are unclear how we got to this window in Grooper Design Studio.
  2. Select Textbox Widget from the list.

  1. This will add a Textbox Widget to the Annotations list.
  2. The only configuration that is strictly required is to indicate which Data Fields you wish to use to create the signature box. Click the ellipsis icon to the right of the Fields property to select these Data Fields.
    • Whatever result is returned by the selected Data Fields will be used to draw and insert the textbox widget. If that Data Field collected a value during the Extract activity, it will also be filled with the returned value.
  3. In the window that pops up, check the box next to the Data Fields you wish to use to create the checkboxes.
    • In our case, we are choosing the "Candidate", "Title of Proposal" and "Country of Travel" Data Fields. Once collected by the Extract activity, Grooper will be supplied the sizes and locations of the Data Field's data instances for each result. This will form the size and location of the textbox widget. The Textbox Widget annotation will then insert the form-fillable textbox into the generated PDF as seen in the "After Annotation" image above. These boxes will also be prefilled with the extraction results from each Data Field.

The Textbox Widget annotation has some additional configuration options as well.

  1. As with all Annotation Types, you can optionally adjust the size of the annotation using the Padding property.
  2. You can also change the font and font size of the editable text in the textbox using the Font Name and Font Size.

As far as looking behind the scenes, there's at least two things going on with how we've set up these Data Fields' extraction, ultimately supplying the result used to insert the Textbox Widget annotation.

First, we used the Highlight Zone extractor type to draw the textbox, defining the size and location of the annotation upon generating the PDF.

  1. We have selected the "Candidate" Data Field in our 'Data Model.
  2. Each Data Field's Value Extractor is set to Highlight Zone.
  3. We used the Relative Region Location option to anchor an extraction zone to the box next to the label "Candidate".
    • This will form the size and and location of the inserted textbox annotation.

Second, we used an expression to return a value, using the results of other Data Fields in our Data Model.

  1. We've used the Calculated Value property (in Calculate Mode Always Set) to return the full name of the candidate extracted by the "Last Name", "First Name", and "Middle Initial" Data Fields
    • The full expression is as follows: Applicant_Information.First_Name + " " + Applicant_Information.Middle_Initial + " " + Applicant_Information.Last_Name
  2. This will take the extraction results of these three Data Fields and concatenate them with space characters in between.

  1. However, if we go to the "Tester" tab...
  2. ... and test extraction, we're going to get an error.
    • We're in the wrong scope! We need to go up to the Data Model's level and test extraction there. We need the full Data Model's results to do what we're trying to do here. Testing extraction on this "Candidate" Data Field, it can't "see" the "Last Name", "First Name" and "Middle Initial" Data Fields results to combine them.

  1. Once we test extraction on the Data Model you'll see what results are actually collected by the Extract activity.
  2. Make sure you're on the "Tester" tab and test the extraction.
  3. The Calculated Value expression we configured forms one result for the "Candidate"...
  4. ...using the results of the "Last Name", "First Name" and "Middle Initial" Data Field's results.
  5. With a result returned and zone drawn upon extract, the Textbox Widget annotation has all the information it needs to place the form-fillable textbox and fill it with the results.


FYI

This certainly isn't the only way to set up a Data Field for a Textbox Widget. This is just how we did it for the point of illustrating the Textbox Widget functionality. You are not required to use the Highlight Zone extractor type. You can use whatever extractor type best suits your document's needs. Often Grooper users will use the Reference extractor to point to a Data Type's results and adjust the size of the Textbox Widget using its Padding property.

Configure PDF Data Mapping for Bookmarks

About

Bookmarks in PDFs aid readers when navigating through multipage documents. PDF Data Mapping can insert bookmarks into the generated PDF to take advantage of this functionality. This can be done in one of two ways (or both):

  1. Using a Batch Folder's child document folders.
  2. Using the document's extracted Data Fields.

We will focus on the bookmarking method (as it is more common). Often it is the case you will import a file into Grooper that has multiple documents inside you want to separate and classify, but otherwise all belong together in one way or another.

Such is the case with our study abroad application packet. The application packet as a whole consists of five separate and distinguishable documents.

  1. The application itself (and a coversheet)
  2. A proposal summary
  3. The student's resume
  4. A letter of recommendation
  5. An essay

Our goal is to create a bookmark in the generated PDF file for each of these component documents (or child documents as we will come to call them).

Rather than exporting five separate PDF files for each component document, we will export a single PDF for the whole packet with navigable bookmarks corresponding to each component document.

  1. Application - For the application itself (and its coversheet)
  2. Proposal Summary - For the proposal summary
  3. Resume - For the student's resume
  4. Rec Letter - For the letter of recommendation
  5. Essay - For the essay

Prereqs - Split Pages, Separation, and Classification

In order to accomplish this goal, we're going to have to do some things to this application packet before we configure PDF Data Mapping.

By the end of it, we're looking for a Batch whose documents have a structure like this. The documents in this batch consist of two Batch Folder levels.

  1. Folder Level 1: This is the parent document folder. It is the container for the full document. All seven pages of the application packet in this case.
  2. Folder Level 2: These are the child document folders for the parent document. They are the containers for each component document of the full application packet.

This is what we want to end up with. How did we get there? Long story short, we have some document separation and classification requirements before we can insert bookmarks in the generated PDF. The bookmarks are inserted for each child document folder and named after their classified Document Type's name. In order to do that, we need to split out the pages of the imported document, separate them into child document folders, and classify them first.

The full application document came into Grooper like this. A 7 page PDF file with each of these 5 component documents was imported into a new Batch. This is now the parent document folder at Folder Level 1.

But there's documents in them there document! How do we get them out?

First, we need to use the Split Pages activity to create child Batch Page objects.

This will split out the pages of the imported PDF file, creating one child Batch Batch for each page in PDF on the parent document folder. Now we have page objects we can manipulate in our Batch.

Now that we have Batch Page objects in our Batch, we can use the Separate activity to insert the second folder level. This is the first step in organizing these pages into child documents. We need to distinguish between one collection of pages as a document and another collection of pages as a document. Creating a folders is the first part of that equation.

Now, we have child document folders for this parent document folder, but they are just blank folders. There is nothing to distinguish one folder from the next.

By default, the Separate activity runs on the Batch level scope, inserting folders at Folder Level 1. When separating child documents like this, you will need to change the Scope property of the Separate activity to run it at the Folder Level 1 scope. This will separate the loose pages of folders at Level 1, inserting child document folders at Level 2 below the parent folder at Level 1.

And, that's the second part of the organization equation, classification. Next, these folders will be assigned a Document Type from our Content Model using the Classify activity.

By default, the Classify activity runs on the Folder Level 1 scope, classifying document folders at the first folder level in the Batch hierarchy. We want to classify the child document folders at Folder Level 2. When classifying child document folders like this, you will need to change the Scope property of the Classify activity to run at the Folder Level 2 scope.

Furthermore, that parent document folder would need a Document Type assigned to it at some point as well. The Batch Process for this Batch might have two Classify activities. One running on Folder Level 1 to classify the parent document folder and another running on Folder Level 2 to classify the child document folders.

Now, we have everything we need to configure the bookmarking functionality of PDF Data Mapping. Bookmarks will be created every time a new child document is encountered and named after the Document Type assigned to that folder.

When the full PDF is generated, a bookmark named "Application" will be inserted at the first page of the PDF. That child document is two pages long. The third page of the full PDF will be the proposal summary. So a bookmark named "Proposal Summery" will be inserted at page three. A "Resume" bookmark will be inserted at page four. And so on.

FYI

There are many ways to separate and classify documents, including ESP Auto Separation which both separates and classifies documents with a single activity (just Separate). But this is the general idea to get us where we need to go.

One way or another, create classified child document folders from a parent document folder. That way when we generate the PDF for the parent document folder upon export, bookmarks will be created for the classified child document folders.

Add the Behavior and Configure It for Bookmarking

Bookmarking is one of the configuration options for the PDF Data Maping Behavior. A Content Type Behavior can tell an activity (specifically the Export activity, in the case of PDF Data Mapping) how to use the Content Type to do something (in this case, how to use the Content Model's Document Types to insert bookmarks into the PDF upon export).

  1. All Behaviors are added to a Content Type object.
    • We will add the PDF Data Mapping behavior to this Content Model named "PDF Data Mapping - UNESCO Packet".
  2. All Behaviors are added using the Behaviors property. Select the Behaviors property and press the ellipsis button at the end to add PDF Data Mapping.
  3. In the Behaviors editor window that pops up, click the "+" button to add a Behavior.
  4. Choose PDF Data Mapping from the list.

  1. Once added, you will see PDF Data Mapping added to the list on the left. Select it.
  2. To enable the bookmarking functionality, in the right panel, click the checkbox next to Bookmarking property.
  3. Open up the subproperties and we see we have two Label properties. Here you can change the Label Style and Label Color to your preference.

For our purposes, this is all we need to configure at this point. However, be aware of the Bookmarking configuration options.

  1. Click the ellipsis icon to the right of the Data Elements property.
  2. In the new "Data Elements" window that pops up, click the check boxes next to the elements you want bookmarked.
    • You can add bookmarks to any of the data elements. You can expand the Data Sections to add individual Data Fields within those sections as well if you like. However, if you add a Data Field that is a child of a Data Section, then that Data Section must be added too.

Configure PDF Data Mapping for Metadata

About

The PDF Data Mapping behavior has the ability to create and insert additional metadata into the generated PDF as well, using information collected during Grooper's document processing. The metadata you are able to create falls into one of three categories:

  1. Editing the PDF's default metadata fields.
    • This includes the following metadata fields that are standard to every PDF file:
      • Title
      • Author
      • Subject
      • Created Date
      • Modified Date
      • Application (Used to establish the "creator" application which created the original file. This can be useful if the original file was created in a different application, like Microsoft Word, and converted to a PDF format by Grooper with a PDF Data Mapping behavior.)
  2. Creating custom metadata fields
    • This is done using extracted Data Field values collected during the Extract activity.
  3. Adding "Keywords" to the PDF metadata
    • This can be done using expression based or extraction based methods.

Notice what's not included in this list is the exported document's filename (e.g. "Im_a_file.pdf"). Filename mappings are always configured using an Export Behavior.


Prereqs - Data Extraction

If we're going to insert some metadata into these PDFs, that data has to come from somewhere. In broad terms, the metadata creation is done in one of two ways (or a combination of the two):

  1. Using expression based creation
    • In the case of the default PDF metadata fields and keywords, expressions can be used to populate the metadata. This gives you access to system data, classification information, extracted Data Field results, and various .NET functions to manipulate it.
  2. Using Data Field results
    • In the case of the custom PDF metadata, the custom fields are generated from Data Fields in the document's Data Model and their collected results from the Extract activity.
    • This means the document must be processed by the Extract activity in order to create and populate these custom fields.

Add the Behavior and Enable Metadata

Metadata is one of the configuration options for the PDF Data Mapping behavior. A Content Type Behavior can tell an activity (specifically the Export activity, in the case of PDF Data Mapping) how to use the Content Type to do something (how to use the Content Model's collected Data Fields and other information to edit the generated PDF's metadata, in this case).

  1. All Behaviors are added to a Content Type object.
    • We will add the PDF Data Mapping behavior to this Content Model named "PDF Data Mapping - UNESCO Packet".
  2. All Behaviors are added using the Behaviors property. Select the Behaviors property and press the ellipsis button at the end to add the PDF Data Mapping behavior.
  3. In the Behaviors editor window that pops up, click the "+" button to add a Behavior.
  4. Choose PDF Data Mapping from the list.

  1. Once added, you will see PDF Data Mapping added to the list on the left. Select it.
  2. To enable the metadata functionality, in the right panel, click the checkbox next to the Metadata property.

Edit Default PDF Metadata

Once enabled, the first six Metadata sub-properties all pertain to the default PDF metadata fields Grooper can edit: Title, Author, Subject, Creation Date, Modified Date, and Creator

These are edited with code expressions.

  1. The Title property corresponds to the PDF's "Title" field.
    • By default, this expression is set to CurrentDocument.ContentTypeName
      • This will make the title whatever the document's Document Type classification is.
      • In our case, these document folders are assigned the "UNESCO Application Packet" Document Type of our Content Model.
  2. The Author property corresponds to the PDF's "Author" field.
    • By default, this expression is set to LDAP.CurrentUserDisplayName
      • This will make the author the display name of whatever user is logged into the machine exporting the documents.
    • We've changed this to Candidate
      • This will make the author the result of the "Candidate" Data Field (which is "Dog O Doggerson" for our example document).
  3. The Creator property corresponds to the PDF's "Application" field.
    • This field is intended to be used when generating PDFs from different file types. For example, if the file was originally a Microsoft Word document, you might enter "Microsoft Word" to fill this field.
    • This field is blank by default, and we have left it so.
  4. The Subject property corresponds to the PDF's "Subject" field.
    • This field is blank by default.
    • We've decided to populate this field with the extracted proposal title, using the results of the "Title of Proposal" Data Field and the expression Title_of_Proposal
      • Note: Spaces in Data Fields must be replaced with underscores in expressions.
  5. The Creation Date' and Modification Date properties correspond to the PDF's "Created" and "Modified" fields.
    • By default, these both use the expression DateTime.Now
      • This will return the current system time of your machine at the time of export.

  1. When we open the document in Adobe Acrobat and view these fields using the "Document Properties" window, you can see the metadata this configuration generated for the PDF.

Add Keywords

Grooper can add keywords into the PDF's "Keywords" field in one of two ways, either using an expression or a referenced extractor's results.

In our case, we're going to use an expression to determine if the word count of the "Essay" document in the application packet is "Long", "Short", or "Normal".

  1. We will use the results of the "Essay Word Count" Data Field of our Data Model to do this.
  2. This Data Field's extraction is configured to count the number of words in the essay.

If the word count is above 600 words, we'll call that a long essay. If it's below 400 words, we'll call that a short essay. And if it's anywhere in between, we'll call it a normal essay.

The expression below uses a series of nested conditional statements using the IIf() function to accomplish this.

IIf(Essay_Information.Essay_Word_Count > 600, "Long Essay", IIf(Essay_Information.Essay_Word_Count > 400, "Normal Essay", "Short Essay"))

If the result is greater than 600 the keyword will evaluate to "Long Essay". Otherwise, if the result is less than 400, the keyword will evaluated to "Short Essay". If neither condition is met, the keyword evaluates to "Normal Essay".

To use this expression to add the keyword to the generated PDF's metadata, we will configure the Keywords property.

  1. In the Metadata sub-properties, select the Keywords property and click the ellipsis button at the end.
  2. In the expression editor that pops up, enter the expression you wish to use create the keywords.
    • As is the case with any expression editor, Grooper's IntelliSense code completion will aid you when writing your code expressions.
  3. Click "OK" when finished.

  1. When we open the generated PDF in Adobe Acrobat and view the "Document Properties" window, you can see the metadata this configuration generated for the PDF.
    • The keyword "Normal Essay" has been added to the keywords list.
    • The extracted value for the "Essay Word Count" field was 485, which is less than 600 and greater than 400. Evaluated by our Keywords expression, that returns a value of "Normal Essay".

Add Custom Metadata

Last but not least, you can add custom metadata fields to the generated PDF using extraction results from the document's Data Model. A custom metadata field is generated for every Data Field you choose in the Content Type's Data Model.

  1. Remember, we add Behaviors to Content Types (Typically a Content Model or a Document Type). In this case we're adding the PDF Data Mapping behavior to the Content Model
  2. Content Models and Document Types can have their own Data Models as one of their children. Configuring PDF Data Mapping on the Content Model, we will utilize its Data Model to export this custom metadata.
  3. This Data Model is configured with several Data Fields. These Data Fields will collect information about the "UNESCO Application Packet" and its component documents, such as the applicant's name and information about the proposal.
    • This will be done during the Extract activity. Once collected, PDF Data Mapping can insert the results into the generated PDF, creating one custom metadata field and corresponding result for each Data Field and its extracted result.

To do this, we will use the Export Data Fields option of PDF Data Mapping's Metadata properties.

  1. In the Metadata sub-properties, click the check box next to the Export Data Fields property to change it from False to True
  2. By default, once you enable this property, Grooper will export all available Data Fields to the Content Type on which PDF Data Mapping is configured.
    • You can be more selective about what you want to include using the Field Filter property.
    • This will give you a drop down list of all the Data Field nodes available for custom PDF metadata creation. You can check the box next to which ones you wish to include, leaving those Data Fields you wish to exclude unchecked.

  1. When we open the generated PDF in Adobe Acrobat and view the "Document Properties" window, you can see the custom metadata generated in the "Custom" tab.
  2. The Data Fields' names show up in the "Names" column.
    • Note: Data Fields in Data Sections will have their names appended to the Data Section's name. For example the "Proposal Title" Data Field in the "Proposal Information" Data Section's name translates to "Proposal_Information.Proposal_Title".
  3. The Data Field's result, collected by the Extract activity show up in the "Value" column.

Be aware the PDF file format has metadata fields already named "Title", "Author", "Subject", "Keywords", "Creator", "Producer", "CreationDate", "ModDate" and "Trapped".

You may run into an issue upon export if you have Data Fields in your Data Model who share one of these names. If using the Metadata creation capabilities of PDF Data Mapping, consider these names "taken" and adjust the name of the Data Field to be something different. For example, in this case a Data Field returning the title of the proposal listed on the application was changed from "Title" to "Title of Proposal"

  1. You can also access this data using the "Additional Metadata..." button in the "Description" tab.
  2. Select the "Advanced" item.
  3. You'll see all the generated custom metadata listed under the "http://ns.adobe.com/pdfx/1.3/" node.

Export the Generated PDF

There's one last crucial step to using the PDF Data Mapping behavior: Exporting the generated PDF.

There's no point in generating the PDF with all this additional metadata, bookmarks and annotations if you don't get it out of Grooper and into some kind of external storage platform. That's the job of the Export activity. To properly export the PDF generated by PDF Data Mapping there are some specific requirements to keep in mind.

Add an Export Behavior

In order to export any document from Grooper, you need to configure an Export Behavior for the Export activity to know how you want to export document folders in a Batch and what external storage platform you're exporting them to. The Export activity is also what generates documents built from Batch Folder content. The PDF Data Mapping behavior gives the Export activity additional information about how to generate it (utilizing the additional metadata, bookmarking, and annotation elements).

There are two ways to configure an Export Behavior:

  1. On a Content Type (A Content Model, Content Category or Document Type)
    • This is the most common and preferred way of setting up an Export Behavior.
    • The export settings are then "shared" with the Export activity based on each document's assigned Document Type.
  2. Locally on the Export activity itself.

Content Type Export Behavior Configuration

  1. Select the Content Type whose Export Behavior you want to configure.
    • For example, we have selected our "PDF Data Mapping - UNESCO Packet" Content Model we've been working with in the Grooper Node Tree.
    • Any child Document Type will use the export settings we configure here.
  2. The Export Behavior is added to the Content Model using the Behaviors property. Clicking the ellipsis button at the end of the Behaviors property will bring up the Behaviors list editor.
  3. Add the Export Behavior by Clicking the "+" button.
  4. Select Export Behavior from the list.
    • If you need more information on how to set up an Export Behavior, visit the Export article.

Export Activity Behavior Configuration

  1. Select the Export Batch Process Step in the Grooper Node Tree.
  2. You configure a local Export Behavior using the Export Behaviors property.
    • If you need more information on how to set up an Export Behavior, visit the Export article.

Configure the Export Behavior to Export PDFs

Whether you elect to use a local or shared Export Behavior, the next step is to configure it to export the document folders in the Batch as PDF files.

  1. First, you will need to add an Export Definition. This will define where and how the documents are exported. Click the ellipsis icon to the right of the property.
  2. Click the "+" button.
  3. Select the Export Type. This controls where the documents are going. What external storage platform will store the generated PDF.
    • This is up to you and your needs. The option you select just needs to be a storage platform that supports the PDF file format.

The last piece of the puzzle is just telling the Export Behavior what file format you want to use for the exported documents. To take advantage of PDF Data Mapping, we will want to tell it to export the documents as PDFs.

  1. Next, you will want to find the Export Formats property.
    • Depending on the Export Type selected, this property may be in a different order in the property grid. Or for Export Types that don't support different formats, this property will not be present (such as the Data Export which just exports data to a database, not files).
  2. Click the ellipsis button at the end to bring up the Export Formats list editor.

  1. Click the "+" button to add an Export Format.
  2. Select PDF Format from the list.

Additional Formatting Considerations

  1. Depending on how your documents were sourced on import, you may need to enable the Always Build mode as well.
    • Such is actually the case in the workflow we've simulated in this tutorial. We processed these UNESCO study abroad application packets from an imported PDF, which we split out into individual page objects, so we could separate out the component Document Types that comprised the full file. The original PDF file from import lives on the parent document's Batch Folder object. If you leave Always Build set to False, that imported file living on the parent document folder is what will get exported, not the PDF built by the PDF Data Mapping with additional metadata, bookmarking and annotation elements.
    • If you run into a situation where the output PDF does not reflect the PDF Data Mapping configurations you've set up, a good first troubleshooting step is changing Always Build to True to ensure the exported file is the PDF Data Mapping built PDF and not the original (also called "native") pre-processed PDF.

The remaining PDF Format property configurates apply more generally to PDF file creation. While they may be important to your end goals, they are independent from PDF Data Mapping concerns.