Extract (Activity)

This article is about the current version of Grooper.

Note that some content may still need to be updated.

2025

WIP

This article is a work-in-progress or created as a placeholder for testing purposes. This article is subject to change and/or expansion. It may be incomplete, inaccurate, or stop abruptly.

This tag will be removed upon draft completion.

export_notes Extract is an Activity that retrieves information from folder Batch Folder documents, as defined by Data Elements in a data_table Data Model. This is how Grooper locates unstructured data on your documents and collects it in a structured, usable format.

About

Data extraction is configured using Data Model objects in a Content Model. This is where you define the data elements you wish to extract from your documents. Appropriately, you define the data to be extracted by adding Data Element objects to the Data Model. There are three main Data Elements:

Data Field
Data Section
Data Table
- Data Tables are also configured with their own special child Data Element: The Data Column object.

The Data Field object is the simplest Data Element. This will allow you to extract a simple list of fields (Such as "Invoice Date", "Invoice Number", "Invoice Amount", etc.).

The Data Table object allows you to extract tabular data. Tables are more complex than simple fields, in that they are a repeating series of fields organized into rows and columns. This requires a more robust Data Element to describe this data structure; hence, the addition of the Data Table object along with it's child Data Column objects.

The Data Section object allows you to extract Data Fields and/or Data Tables in repeating sections of a document. Data Sections may even have their own child Data Sections. This allows you to divide your document into sections and sub-sections, giving your Data Model its own levels of data hierarchy.

When the Extract activity runs, it will populate the Data Model with values extracted from the document's text data (obtained from the Recognize activity). How this text is located and returned is determined by the extraction configurations set on each Data Element.

What is the Extract Activity?

The Extract Activity in Grooper is a core step in document processing that performs data extraction from documents in a Batch. Its main purpose is to populate the Data Model with extracted information, making it available for review, validation, and export.

Extraction is defined by Value Extractors, which are configured on different elements within the Data Model. These include:

Data Field: A single value extractor that captures specific information from a document, such as an invoice number or date.
Data Table: An extractor that captures tabular, repeating data, such as line items on an invoice. Each table consists of columns (fields) and rows.
Data Section: A hierarchical extractor that groups related fields and tables, allowing for logical organization and extraction of complex document structures.

When the Extract Activity runs, it uses these Value Extractors to read and extract data from each document in the Batch, populating the corresponding Data Model for each document.

Data Extractors

After defining what Data Elements you want to extract, you need to define how to populate those fields, tables, and sections with data. This is done with Data Extractors, often shorthanded to just "extractors".

Data Hierarchy

As discussed earlier, you can create hierarchical relationships within a single Data Model using Data Sections and Data Tables. As a direct child of a Data Model a Data Field will execute against the entire document. However, as a child of a Data Section a Data Field will only execute against the portion of the document described by that Data Section.

Data Models also benefit from a Content Model's inheritance structure. For example, the Content Model itself may have a Data Model but a Document Type may also have its own Data Model. The Document Type, as a child of the Content Model, will inherit all Data Elements from the parent Content Model's Data Model.

Why use the Extract Activity?

The Extract Activity is essential because it is part of the Collect Phase in Grooper’s five-phase processing model. This phase is where data is gathered from documents in a Batch. Without extraction, there would be no data to review, validate, or export in later phases.

Key reasons to use the Extract Activity:

It collects structured data from documents, enabling downstream review and export.
It ensures that the Data Model is populated, making extracted data available for business processes.
It is a required step for any solution that needs to transform unstructured documents into usable data.

How to configure the Extract Activity

Follow these steps to add and configure the Extract Activity in a Batch Process:

Open the desired Batch Process in Grooper Design Studio.
Right-click on the process tree and select Add Activity.
In the activity type list, choose Extract and click OK.
Select the new Extract Activity node.
Configure the following key properties:
1. Mode: Choose how extraction handles existing data (Normal, Additive, or Recalculate).
2. Default Content Type: Set a fallback Content Type for unclassified folders.
3. Content Type Filter: Optionally restrict extraction to specific Content Types.
4. Data Element Filter: Optionally restrict extraction to specific Data Elements.
5. Rules: Add any Data Rules for post-processing or validation.
6. Flag Invalid Items: Enable to flag folders with validation errors.
7. Purge Alternate Candidates: Enable to remove alternate field values before saving.
8. Purge Empty Fields: Enable to remove empty fields before saving.
9. Stats Logging: Set the level of extraction statistics to record.

To test the Extract Activity:

Select the Extract Activity in the process tree.
Go to the Activity Tester tab.
Choose a Batch or Batch Folder to test.
Click Run to execute extraction and review the results.

Extraction example

Suppose you have a Batch of invoices and want to extract key data for review and export. Your Data Model might include:

Data Field: "Invoice Number"
Data Field: "Invoice Date"
Data Table: "Line Items" (with columns for Description, Quantity, Unit Price, and Line Total)
Data Section: "Vendor Information" (with fields for Vendor Name, Address, and Phone)

By configuring the Extract Activity in your Batch Process, Grooper will automatically extract these values from each invoice, populate the Data Model, and make the data available for validation and export.