2021:Data Export (Export Definition)

From Grooper Wiki
WIP This article is a work-in-progress or created as a placeholder for testing purposes. This article is subject to change and/or expansion. It may be incomplete, inaccurate, or stop abruptly.

This tag will be removed upon draft completion.

Data Export is one of the Export Types available when configuring an Export Behavior. It exports extracted document data over a Data Connection, allowing users to export data to a SQL or ODBC compliant database.

About

You may download and import the file below into your own Grooper environment (version 2021). This contains Batches with the example document(s) and a Content Model discussed in this article

  • [[]]

About

The most important goal of Grooper is to deliver accurate data to line of business systems that allow the information to be integrated into impactful business decisioning. Tables in databases remain, to this day, one of the main vessels by which this information is stored. Data Export is one of the main ways to deliver data collected in Grooper.


There are three important things to understand when using and configuring Data Export to export data to a database:

  • The Export activity.
  • Data Elements
  • Data Connections

The Export Activity

Grooper's Export activity is the mechanism by which Grooper-processed document content is delivered to an external storage platform. Export configurations are defined by adding Export Type definitions to Export Behaviors. Data Export is the Export Type designed to export Batch Folder document data collected by the Extract activity to a Microsoft SQL Server or ODBC-compliant database server.

For more information on configuring Export Behaviors, please visit the full Export activity article.

Data Elements

Data Export is the chief delivery device for "collection" elements. Data is collected in Grooper by executing the Extract activity, extracting values from a Batch Folder according to its classified Document Type's Data Model.

A Data Model in Grooper is a digital representation of document data targeted for extraction, defining the data structure for a Content Type in a Content Model. Data Models are objects comprised of Data Element objects, including:

  • Data Fields used to target single field values on a document.
  • Data Tables and their child Data Columns used to target tabular data on a document.
  • Data Sections used to divide a document into sections to simplify extraction logic and/or target repeating sections of extractable Data Elements on a single document.

With Data Models and their child Data Elements configured, Grooper collects values using the Extract activity.

Depending on the Content Type hierarchy in a Content Model and/or Data Element hierarchy in a Data Model, there will be a collection, or "set", of values for varying data scope of a fully extracted Data Model's hierarchy. That may be the full data scope of the Data Model, including any inherited Data Elements inherited from parent Data Models. It may be a narrower scope of Data Elements like a child Data Section comprised of its own child Data Fields.

Understanding this will be important as Data Export has the ability to take full advantage of Grooper's hierarchical data modeling to flatten complex and inherited data structures. Understanding Data Element hierarchy and scope will also be critical when exporting data from a single document to multiple different database tables to ensure the right data exports to the right places.

Data Connections

Data Export uses a configured Data Connection object to establish a link to SQL or ODBC compliant database tables in a database and intelligently populate said tables. Once this connection is established, collected Data Elements can be mapped to corresponding column locations in one or multiple database tables. Much of Data Export's configuration is assigning these data mappings. The Data Connection presents these mappable data endpoints to Grooper as well as allowing data content to flow from Grooper to the database table when the Export activity processes each Batch Folder in a Batch.

Furthermore, not only can Grooper connect to existing databases using a Data Connection, but it can create whole new databases as well as database tables once a connection to the database server is established.

We discuss how to create Data Connections, add a new database from a Data Connection, and add a new database table from a Data Connection in the #Configuring a Data Connection tutorial below.

How To

In the following tutorials, we discuss an an example of how to setup an Export Behavior, using Data Export.

In this example are a couple of different document formats whose data will be collected by a single Content Model. This will help illustrate two key distinctions mentioned above.

  • The first document (the "Employee Report) will demonstrate the flattening of inherited data.
  • The second (the "Personnel Information Report") will give us an avenue to export to multiple tables at once.


The example documents have already been classified and their data extracted. Before we get into Data Export configuration specifics, there's a few things we need to understand about "the story so far"

  • The forms used in this example and how they are formatted.
  • The Content Model used in this example, its Document Types and how their Data Models represent the data we want to collect.
  • The extracted index data, which will be exported to a database.

Understanding the Forms

Document 1: Employee Report

The thing to understand about this document is some of its data share a "one-to-many" relationship.

Some of the data is described as "single instance" data. These are individual fields like "Employee Last Name", "Employee First Name" and "Employee ID". For each document, there is only one value for each of these fields. These values are only listed once, and hence only collected once during extraction.

Some of the data, however, is described as "multi-instance" data. The "Earnings" table displays a dynamic amount of rows, for which there may be a varying number of data for its columns ("Code Desc", "MTD", "QTD", "YTD") depending on how many rows are in the table. There are multiple instances of the "YTD" value for the whole table (and therefore the whole document).

The single instance data, as a result of only being listed once on the document, will only be collected once, but needs to be married to each row of information from the table, in one way or another. The "one" "Employee ID" value, for example, pertains to the "many" different table rows.

This document is meant to show how to flatten data structures. While the single instance data is only collected once, it will be reported many times upon exporting to a database table.

Document 2: Personnel Information Report

The second document is essentially one big table of personnel information (name, address, email, phone number and the like).

While we ultimately want to collect data from all rows in this table, there are potentially two sets of information here. Some of it is generic personnel information, but some of it is "personally identifiable information" or PII. This information should be protected for legal reasons.

As a result, we will export collected data to two database tables (with the assumption that the second table is "protected".)

This document is meant to demonstrate how to export to multiple tables via one Export Behavior.

DISCLAIMER: The author of this article is not a lawyer.

For educational purposes we divided "Personnel Information" into two sets: "non-PII" and "PII"

•  non-PII: Employee ID, Phone Number, E-mail address, IP Address, Gender, ZIP
•  PII: First Name, Last Name, SSN, Street Number, Street Name, City, State.

DO NOT TAKE THIS TO MEAN THIS INFORMATION IS OR IS NOT PERSONALLY IDENTIFIABLE INFORMATION.

This division was done purely for educational purposes to demonstrate a concept.

PII gets tricky in the real world. For example, an IP Address would not normally qualify as PII by itself, but could when combined with other personal information. Please consult your own legal department to determine what data you're collecting is PII and should be protected more securely.

Back to top to continue to next tab

Understanding the Content Model

The Content Model provided for this tutorial is named "Example Model - Data Export". This Content Model is designed to extract the data for these two different kinds of documents, each represented by its own Document Type.

The Employee Report Document Type

  1. The "Employee Report" Document Type' represents our "Employee Report" document.
    • This is the one we will use to demonstrate flattening a data structure that shares a one-to-many relationship on the document.
  2. It has its own child Data Model with Data Elements already configured for extraction.
  3. For the individual fields, represented once in the document, there are three corresponding Data Field elements:
    • "Last Name"
    • "First Name"
    • "Employee ID"
  4. For the tabular data, a Data Table named "Earnings" is established, with four Data Column elements, corresponding to the columns on the document:
    • "Code Desc"
    • "MTD"
    • "QTD"
    • "YTD"

The Personnel Info Report Document Type

  1. The "Personnel Info Report" Document Type represents our "personnel information" documents.
    • This is the one we will use to demonstrate exporting multiple table structures to multiple databases from a single document.
  2. It too has its own child Data Model with Data Elements already configured for extraction.
  3. The "non-PII" Data Table will extract data from each table row on the document for non-protected personnel information, as described by its child Data Column elements:
    • "Employee ID"
    • "Phone Number"
    • "EMail"
    • "Gender"
    • "Zip"
  4. The "PII" Data Table will extract data from each table row on the document for protected personnel information, as described by its child Data Column elements:
    • "Employee ID"
    • "First Name"
    • "Last Name"
    • "SSN"
    • "Street Number"
    • "Street Name"
    • "City"
    • "State"
  5. This set-up uses a single table extractor to collect each row on the document, but reports it to two different Data Table objects in Grooper (the "non-PII" Data Table and the "PII" Data Table). This will ultimately allow us to parse data from these columns, and place the PII related information into a separate database.

Back to top to continue to next tab

Verifying Index Data

Before the Database Export activity can send data, it must have data!

It's easy to get in the habit of testing extraction on a Data Field or a Data Model and feel good about the results, but it must be understood that the information displayed when doing so is in memory, or temporary. When testing a Data Export configuration, it's a good idea to ensure extracted data is actually present for document Batch Folders whose data you want to export.

When the Extract activity runs, it executes all extraction logic for the Data Model tied to a Batch Folder's classified Document Type. For each Batch Folder document, it creates "Index Data" and marries it to the Batch Folder via a JSON file called Grooper.DocumentData.json.

A couple of ways to verify its existence are as follows:

Option 1

  1. First, navigate to a document Batch Folder object in the node tree.
    • Not the Batch, not the root Batch Folder, not a Page object, but specifically a document Batch Folder object.
    • Here, we have the first Batch Folder in the root folder of our Batch selected.
    • This is where the extracted index data information lives.
  2. From there, click on the "Index Data" tab.
  3. After doing so you can see the extracted data displayed.

FYI Were the document classified (the Batch Folder assigned a Document Type), but not successfully extracted (the Extract activity not applied), the data structure would be present, but the fields empty.

At that point the Batch Folder would have the Document Type's Data Model associated to it, but would not have applied its extraction logic to collect values (That's what the Extract activity is for).

Option 2

Another means of verifying is to actually view the file created by the Extract activity and stored in the Grooper repository's file store location.

  1. Again, click on a document Batch Folder object in the node tree.
  2. Click the "Advanced" tab.
  3. Click the "Files" tab.
  4. In the List View panel you should see a file named Grooper.DocumentData.json.
    • This is the file the Extract activity generates when it processes a Batch Folder. It serializes all the extracted Data Elements' values for the document's Data Model.
  5. When you click on that file, you should see the stored JSON information of the indexed data displayed in the viewer below.

Back to top to continue to next tab

Configuring a Data Connection

In order for the Data Export to run, it first needs an established connection to a database and subsequent table(s).

Grooper can connect to an existing database and import references to its tables using a Data Connection object. Once connected to the database server, you can even have Grooper create a database AND create tables based on Data Model structures present in Grooper!

In the following tutorial we will cover how to:

  1. Create a new Data Connection object.
  2. Connect to a database server using the Data Connection.
  3. Create a new database from the Data Connection.
  4. Create a database table using the Data Models in our example Content Model.

It is worth noting that this article cannot tell you specifics about permissions in your own environment. The configuration for this article uses Microsoft SQL Server and has given the active Active Directory user full DB Admin privileges to the SQL environment.

Create a Data Connection

New Data Connections are added to the Global Resources folder of the Node Tree.

  1. Right click the Global Resources folder.
  2. Select "Add" then "Data Connection..."
  3. The "Add New Data Connection" window will appear.
  4. Give the new Data Connection a name.
    • We've named ours "DB Export"
  5. Press "OK" when finished.

  1. This will add the new Data Connection to the Global Resources folder.
  2. Next, we will configure connection settings so that Grooper can interoperate with our SQL server.

Back to top to continue to next tab

Configure Connection Settings

Regardless whether you want to connect to an existing database or create a new one from Grooper, your first step is always the same. You must first connect to a database server.

Grooper can connect to Microsoft SQL servers or any ODBC (Open Database Connectivity) compliant data source. For the purposes of this tutorial, we will connect to a SQL server.

  1. However, you can choose to connect to a SQL server or ODBC server using the Connection Settings property.
  2. Using the dropdown menu, choose either SQL Server or ODBC
    • Again, we will be connecting to a SQL server. So, SQL Server is selected.

Next, we need to define settings to access the database server. All you really need to do this is the server's name and access rights.

  1. Expand the Connection Settings property.
  2. Using the Server Name property, enter the SQL server name.
    • I was a little on the lazy side for this article and just connected to the SQL Express instance created for the Grooper repository database on my local machine. This is almost assuredly not what you want to do in a production environment, but it will work for testing purposes.
  3. You will also need access rights to the SQL server. Using this Data Connection object, Grooper is going to act as if it were a user, giving Grooper the capabilities to do things like add and drop tables, run queries, and more. Just like a user needs SQL access credentials, so does Grooper.
    • If the active Windows user has Active Directory rights to the SQL server, Windows will pass through your credentials. You can leave the User Name and Password blank in that case.
    • Otherwise, you'll need to enter a User Name and Password to access the database server here.
  4. Go ahead and hit "Save" at this point.

That's it! You're officially connected to the database server now. We can now connect to existing databases, import references to their tables (Keep this in the back of your mind. This will be important later.), create new databases, and new database tables.

  1. If you want to verify your connection, press the "Test Connection" button.
  2. The following message will appear if you are successfully connected to the SQL server.

Back to top to continue to next tab

At this point, you have two options:

  1. Connect to an existing database
  2. Create a new database

If you wanted to connect to an existing database, it's very easy.

  1. Select the Database Name property.
  2. Either type in the database's name or select it from the drop down menu.
    • FYI: Grooper has to ping the database in order to populate the dropdown list. Depending on the size of your SQL server, it's often quickest to just type in the database name.

However, now that we're connected to the database, you can also create a brand new database!

Create a New Database from the Data Connection

  1. To create a new database, press the "Create Database..." button.
  2. This will bring up the "Create Database" window.
  3. Using the Database Name property, name the database whatever you like.
    • We named ours "Export_Example_DB"
  4. Press the "Execute" button to create the database.

  1. You will see the newly created database's name populate the Database Name property.
  2. However, as a newly formed, infant database, it has no database tables. You can see, the "Database Tables" panel is empty. Next, we will add some database tables using the data structures of the Data Models in our example Content Model.

FYI If you connect to the server in Microsoft SQL Management Studio, you can verify the database is created.

See here, the "Export_Example_DB" is added to the list of databases. Through the Data Connection Grooper has a direct connection to this SQL environment to add and alter databases.

Back to top to continue to next tab

Create Database Tables from the Data Connection

We will end up creating three database tables by the end of this section:

  1. A table for the "Employee Report" Document Type's extracted Data Elements with its "one-to-many" related data elements flattened to a single table structure.
  2. A table for the "Personnel Info Report" Document Type's extracted "non-PII" Data Table.
  3. A table for the "Personnel Info Report" Document Type's extracted "PII" Data Table.

Table 1: Employee Report Data

  1. To create a new database table, press the "Create Table..." button.
  2. This will bring up the "Create Table" window.
  3. Database tables are created from Grooper using Data Elements from a Data Model. The Data Model's Data Fields and/or Data Columns will form the columns of the SQL table, housing extracted values from each document folder upon export. First, you must define the Content Type (e.g. a Content Model or one of its Document Types) whose Data Model you want to use to create the database table.
  4. The Content Type property's dropdown menu will present you a mini Node Tree view to select a Content Type from your Content Models folder.
    • In our case, we will select the "Employee Report" Document Type from our "Example Model - Data Export" Content Model.
    • Why didn't we choose the parent Content Model? It's all about Data Model hierarchy and scope. We want access to all the Data Elements in the "Employee Report" Document Type's Data Model. If we chose the parent Content Model, we would only have access to its Data Model's Data Elements, which would not include any of the Data Elements we want to export.
      • Data Elements are passed from parent Data Model to child Data Model not the other way around.

  1. Next, you will need to select the data scope in the Content Type's Data Model, using the Data Element Scope property.
    • This property will present you a drop down to select either the parent Data Model or sub-levels in its Data Element scope, branched by Data Section or Data Table elements.
    • Choosing a data scope is all about defining which Data Elements are accessible for database table creation (and ultimately data export to the created table).
  2. In our case, we want to choose the "Earnings" Data Table for our scope.
    • Think back to the notion of the one-to-many relationship. This database table can have a dynamic number of rows and the Data Columns are capable of capturing and reporting back unlimited instances of data, hence multiple rows. The Data Fields within this scope, however, are only capable of capturing and returning a single piece of data.
    • But, given the nature of hierarchy inheritance, at the "Earnings" Data Table scope, the database table that will be created will make columns not just for the Data Columns of the Data Table, but for each of the Data Fields contained within the Data Model's scope as well (i.e. The "Last Name", "First Name" and "Employee ID" Data Fields).
    • FYI: Were there Data Fields further up the inheritance tree, say at the base Data Model of the parent Content Model, the database table would also inherit those as well.
    • In other words, think about what Data Elements you want access to and go down the Data Model's hierarchy to widen the Data Element scope.

  1. The Table Name property will auto-populate a concatenation of the Content Type and Data Element Scope properties string values.
    • However, you can change this to whatever you want.
  2. Now, you're ready to create the database table. Press the "Execute" button.

  1. Here's where Grooper is doing the hard work for us. Notice in the "Review/Edit SQL Statement" window the SQL Statement required to create our table is already written for us.
  2. However, you can edit this statement, if need be.
    • For example, at the time of writing this article there was a bug involving the "Employee ID" Data Field. It's Value Type in Grooper is set to Int16. This was not properly converting to a "smallint" SQL data type for the corresponding SQL column. So we added smallint to the SQL Statement after [Employee ID].
  3. Click the "Run SQL Statement" button to create the table.

Back to top to continue to next tab