Extract (Activity): Difference between revisions

From Grooper Wiki
No edit summary
 
(35 intermediate revisions by 2 users not shown)
Line 1: Line 1:
{{stubs}}
{{AutoVersion}}
<section begin="glossary" />
<blockquote>
The '''Extract''' activity extracts data (defined by a '''Data Model''') from a document.
</blockquote>
<section end="glossary" />
== About ==


Data extraction is configured using '''Data Model''' objects in a '''Content Model'''.  This is where you define the data elements you wish to extract from your documents.  Appropriately, you define the data to be extracted by adding '''Data Element''' objects to the '''Data Model'''.  There are three main '''Data Elements''':
<blockquote>{{#lst:Glossary|Extract}}</blockquote>


* '''Data Field'''
== What is the Extract Activity? ==
* '''Data Section'''
The '''Extract Activity''' in Grooper is a core step in document processing that performs data extraction from documents in a [[Batch]]. Its main purpose is to populate the [[Data Model]] with extracted information, making it available for review, validation, and export.
* '''Data Table'''
** '''Data Tables''' are also configured with their own special child '''Data Element''':  The '''Data Column''' object.


The '''Data Field''' object is the simplest '''Data Element'''. This will allow you to extract a simple list of fields (Such as "Invoice Date", "Invoice Number", "Invoice Amount", etc.).
Extract is part of the Collect phase of Grooper's five phase process. These phases are as follows:


The '''Data Table''' object allows you to extract [[Table Extraction|tabular]] data. Tables are more complex than simple fields, in that they are a repeating series of fields organized into rows and columns. This requires a more robust '''Data Element''' to describe this data structure; hence, the addition of the '''Data Table''' object along with it's child '''Data Column''' objects.
* Acquire
** This is where the documents come in to Grooper. This can either be via the [[Importing Documents in Grooper| importing]] of digital documents, or scanning in physical pages.
* Condition
** Where documents are made ready for processing. This includes Batch Process Steps such as Recognize where Image Processing and OCR are run.
* Organize
** Where documents are separated and classified.
* Collect
** Where data is collected from documents. <-- You are here when performing extraction.
* Deliver
** Documents are exported from Grooper to wherever the user or organization deems fit.  


The '''Data Section''' object allows you to extract '''Data Fields''' and/or '''Data Tables''' in repeating sections of a document.  '''Data Sections''' may even have their own child '''Data Sections'''.  This allows you to divide your document into sections and sub-sections, giving your '''Data Model''' its own levels of data hierarchy.


When the '''Extract''' activity runs, it will populate the '''Data Model''' with values extracted from the document's text data (obtained from the '''[[Recognize]]''' activity).  How this text is located and returned is determined by the extraction configurations set on each '''Data Element'''.
Extraction is defined by extraction objects within the Node Tree, and then organized by the Data Model through Data Hierarchy. These objects (which are child objects of the Data Model) are:


=== Data Extractors ===
* '''Data Field''': A single value extractor that captures specific information from a document, such as an invoice number or date.
* '''Data Table''': An extractor that captures tabular, repeating data, such as line items on an invoice. Each table consists of columns (fields) and rows.
* '''Data Section''': A hierarchical extractor that groups related fields and tables, allowing for logical organization and extraction of complex document structures.


After defining what '''Data Elements''' you want to extract, you need to define ''how'' to populate those fields, tables, and sections with data.  This is done with [[Data Extractor]]s, often shorthanded to just "extractors".
When the Extract Activity runs, it uses these elements to extract data from each document in the [[Batch]], populating the corresponding [[Data Model]] for each document.


== Data Hierarchy ==


As discussed earlier, you can create hierarchical relationships within a single '''Data Model''' using '''Data Sections''' and '''Data Tables'''.  As a direct child of a '''Data Model''' a '''Data Field''' will execute against the entire document.  However, as a child of a '''Data Section''' a '''Data Field''' will only execute against the portion of the document described by that '''Data Section'''.
=== Data Hierarchy ===


'''Data Models''' also benefit from a '''Content Model's''' inheritance structureFor example, the '''Content Model''' itself may have a '''Data Model''' but a '''Document Type''' may also have its own '''Data Model'''The '''Document Type''', as a child of the '''Content Model''', will inherit all '''Data Elements''' from the parent '''Content Model's''' '''Data Model.'''
As discussed earlier, you can create hierarchical relationships within a single Data Model using Data Sections and Data TablesAs a direct child of a Data Model a Data Field will execute against the entire documentHowever, as a child of a Data Section a Data Field will only execute against the portion of the document described by that Data Section.


[[Category:Articles]]
Data Models also benefit from a Content Model's inheritance structure.  For example, the Content Model itself may have a Data Model but a Document Type may also have its own Data Model.  The Document Type, as a child of the Content Model, will inherit all Data Elements' from the parent Content Model's Data Model.
 
=== Value Extractors ===
 
After defining what Data Elements you want to extract, you need to define how to populate those fields, tables, and sections with data.  This is done with [[Value Extractor]]s, often shorthanded to just "extractors".
 
These are properties that you can configure on a Data Field, Data Table, and/or Data Section.
 
=== Extractor Nodes ===
[[Data Extractor (Concept) #Extractor Nodes| Extractor Nodes]] are the tools the Extract activity uses to find information in documents. Grooper offers three main types, each suited to a different level of complexity. They are:
 
<big>Value Reader</big>
 
A Value Reader is the simplest option. You pick one extraction method (such as a pattern match, list, barcode read, zone, mark recognition, or a reference to another extractor) and it returns every match it finds.
 
Use it to:
* Prototype a new search.
* Reuse a common rule across multiple fields.
* Extract obvious, unambiguous values (e.g., a clearly formatted invoice number).
 
<big>Data Type</big>
 
A Data Type gathers results from:
* A single local method (optional).
* Any child extractors beneath it.
* Any referenced extractors you attach.
 
<big>Field Class</big>
 
A Field Class is for situations where several similar candidates appear (multiple dates, totals, names) and only one is correct. It uses training: you review examples, mark correct and incorrect instances, and it learns which nearby words or layout patterns signal the right one.
 
== Why use the Extract Activity? ==
Simply put, without extraction, there would be no data to review, validate, or export in later phases. So, if you want to gather ANY data from a document in Grooper, this Activity is vital.
 
Key reasons to use the Extract Activity:
* It collects structured data from documents, enabling review (if necessary) and export.
* It ensures that the [[Data Model]] is populated, making extracted data available for business processes.
* It is a required step for getting data from documents.
 
== How to add the Extract Activity ==
Follow these steps to add and configure the Extract Activity in a [[Batch Process]]:
 
# Open the desired [[Batch Process]] in Grooper Design Studio.
# Right-click on the process tree and select "Add Activity".
# In the activity type list, choose "Extract" and click "OK".
# Select the new Extract Activity node.
# Configure properties if need be.
 
<div style="position: relative; box-sizing: content-box; max-height: 80vh; max-height: 80svh; width: 100%; aspect-ratio: 1.7777777777777777; padding: 40px 0 40px 0;"><iframe src="https://app.supademo.com/embed/cmhyzi46i00012o0iz5fsz62j?embed_v=2&utm_source=embed" loading="lazy" title="2025 - How to add the Extract Activity" allow="clipboard-write" frameborder="0" webkitallowfullscreen="true" mozallowfullscreen="true" allowfullscreen style="position: absolute; top: 0; left: 0; width: 100%; height: 100%;"></iframe></div>
 
=== Configuring the Extract Activity ===
Should you need to configure the Extract Activity beyond the default properties, you will need to be aware of each property and how it affects extraction. These properties are:
 
* '''Mode''': Choose how extraction handles existing data (Normal, Additive, or Recalculate).
** '''Normal''': Existing extraction results will be overwritten.
** '''Additive''': Data Elements that already have a value from previous extraction will not have those values overwritten — Grooper will only perform extraction for Data Elements that do not have extracted data already.
** '''Recalculate''': All calculated fields will be recalculated, and full validation performed.
* '''Default Content Type''': Set a fallback [[Content Type]] for unclassified folders.
* '''Content Type Filter''': Optionally restrict extraction to specific [[Content Type]]s.
* '''Data Element Filter''': Optionally restrict extraction to specific [[Data Element]]s.
* '''Rules''': Add any [[Data Rule]]s for post-processing or validation.
* '''Flag Invalid Items''': Enable to flag folders with validation errors.
* '''Purge Alternate Candidates''': Enable to remove alternate field values before saving.
* '''Purge Empty Fields''': Enable to remove empty fields before saving.
* '''Stats Logging''': Set the level of extraction statistics to record.
 
These properties can be configured in either Activity Properties panel, or by expanding the Activity property within the Step Properties panel.
 
 
[[file:2025_Extract_(Activity)_How_to_add_the_Extract_Activity_Configuring_the_Extract_Activty_01(1).png]]
 
{|class="fyi-box"
|-
|
'''FYI'''
|
When adding the Extract Activity to your Batch Process, it is ''crucial'' that the step be added AFTER Recognize. This will ensure that there is text and layout data to extract.
|}
 
===To test the Extract Activity on the Batch Process Step:===
# Open your Batch Process in the Node Tree.
# Select the Extract Step.
# Go to the Activity Tester tab.
# Choose a Batch on whose documents you wish to text extraction.
# Click the play button.
#*<li class="fyi-bullet"> Make sure that the document has been OCR'd and has a Content Type assigned to it before testing extraction.
# To see the results, select the "View Diagnostics" button.
# The diagnostics will open up in a new tab;
 
<div style="position: relative; box-sizing: content-box; max-height: 80vh; max-height: 80svh; width: 100%; aspect-ratio: 1.7777777777777777; padding: 40px 0 40px 0;"><iframe src="https://app.supademo.com/embed/cmhzc2o3a01e00i0i03t2z2vg?embed_v=2&utm_source=embed" loading="lazy" title="2025 - To test the Extract Activity on the Batch Process Step." allow="clipboard-write" frameborder="0" webkitallowfullscreen="true" mozallowfullscreen="true" allowfullscreen style="position: absolute; top: 0; left: 0; width: 100%; height: 100%;"></iframe></div>
 
== Extraction example ==
Suppose you have a [[Batch]] of invoices and want to extract key data for review and export. Your [[Data Model]] might include:
 
* '''Data Field''': "Invoice Number"
* '''Data Field''': "Invoice Date"
* '''Data Table''': "Line Items" (with columns for Description, Quantity, Unit Price, and Line Total)
* '''Data Section''': "Vendor Information" (with fields for Vendor Name, Address, and Phone)
 
By configuring the Extract Activity in your [[Batch Process]], Grooper will automatically extract these values from each invoice, populate the [[Data Model]], and make the data available for validation and export.
 
In this example, we will look at running the Extract Activity on a Batch of invoices. We'll start by testing the activity, and then seeing in in action during a Batch Process.
 
 
<div style="position: relative; box-sizing: content-box; max-height: 80vh; max-height: 80svh; width: 100%; aspect-ratio: 1.7777777777777777; padding: 40px 0 40px 0;"><iframe src="https://app.supademo.com/embed/cmi3kjrfw04ynwj0iagulffh8?embed_v=2&utm_source=embed" loading="lazy" title="2025 - Extract (Activity) Example: Invoices" allow="clipboard-write" frameborder="0" webkitallowfullscreen="true" mozallowfullscreen="true" allowfullscreen style="position: absolute; top: 0; left: 0; width: 100%; height: 100%;"></iframe></div>
 
== See also ==
* [[Batch Folder]]
* [[Batch Process]]
* [[Content Type]]
* [[Data Model]]
* [[Data Field]]
* [[Data Table]]
* [[Data Section]]
* [[Data Rule]]
* [[Activity Tester]]

Latest revision as of 15:21, 21 November 2025

This article is about the current version of Grooper.

Note that some content may still need to be updated.

2025

export_notes Extract is an Activity that retrieves information from folder Batch Folder documents, as defined by Data Elements in a data_table Data Model. This is how Grooper locates unstructured data on your documents and collects it in a structured, usable format.

What is the Extract Activity?

The Extract Activity in Grooper is a core step in document processing that performs data extraction from documents in a Batch. Its main purpose is to populate the Data Model with extracted information, making it available for review, validation, and export.

Extract is part of the Collect phase of Grooper's five phase process. These phases are as follows:

  • Acquire
    • This is where the documents come in to Grooper. This can either be via the importing of digital documents, or scanning in physical pages.
  • Condition
    • Where documents are made ready for processing. This includes Batch Process Steps such as Recognize where Image Processing and OCR are run.
  • Organize
    • Where documents are separated and classified.
  • Collect
    • Where data is collected from documents. <-- You are here when performing extraction.
  • Deliver
    • Documents are exported from Grooper to wherever the user or organization deems fit.


Extraction is defined by extraction objects within the Node Tree, and then organized by the Data Model through Data Hierarchy. These objects (which are child objects of the Data Model) are:

  • Data Field: A single value extractor that captures specific information from a document, such as an invoice number or date.
  • Data Table: An extractor that captures tabular, repeating data, such as line items on an invoice. Each table consists of columns (fields) and rows.
  • Data Section: A hierarchical extractor that groups related fields and tables, allowing for logical organization and extraction of complex document structures.

When the Extract Activity runs, it uses these elements to extract data from each document in the Batch, populating the corresponding Data Model for each document.


Data Hierarchy

As discussed earlier, you can create hierarchical relationships within a single Data Model using Data Sections and Data Tables. As a direct child of a Data Model a Data Field will execute against the entire document. However, as a child of a Data Section a Data Field will only execute against the portion of the document described by that Data Section.

Data Models also benefit from a Content Model's inheritance structure. For example, the Content Model itself may have a Data Model but a Document Type may also have its own Data Model. The Document Type, as a child of the Content Model, will inherit all Data Elements' from the parent Content Model's Data Model.

Value Extractors

After defining what Data Elements you want to extract, you need to define how to populate those fields, tables, and sections with data. This is done with Value Extractors, often shorthanded to just "extractors".

These are properties that you can configure on a Data Field, Data Table, and/or Data Section.

Extractor Nodes

Extractor Nodes are the tools the Extract activity uses to find information in documents. Grooper offers three main types, each suited to a different level of complexity. They are:

Value Reader

A Value Reader is the simplest option. You pick one extraction method (such as a pattern match, list, barcode read, zone, mark recognition, or a reference to another extractor) and it returns every match it finds.

Use it to:

  • Prototype a new search.
  • Reuse a common rule across multiple fields.
  • Extract obvious, unambiguous values (e.g., a clearly formatted invoice number).

Data Type

A Data Type gathers results from:

  • A single local method (optional).
  • Any child extractors beneath it.
  • Any referenced extractors you attach.

Field Class

A Field Class is for situations where several similar candidates appear (multiple dates, totals, names) and only one is correct. It uses training: you review examples, mark correct and incorrect instances, and it learns which nearby words or layout patterns signal the right one.

Why use the Extract Activity?

Simply put, without extraction, there would be no data to review, validate, or export in later phases. So, if you want to gather ANY data from a document in Grooper, this Activity is vital.

Key reasons to use the Extract Activity:

  • It collects structured data from documents, enabling review (if necessary) and export.
  • It ensures that the Data Model is populated, making extracted data available for business processes.
  • It is a required step for getting data from documents.

How to add the Extract Activity

Follow these steps to add and configure the Extract Activity in a Batch Process:

  1. Open the desired Batch Process in Grooper Design Studio.
  2. Right-click on the process tree and select "Add Activity".
  3. In the activity type list, choose "Extract" and click "OK".
  4. Select the new Extract Activity node.
  5. Configure properties if need be.

Configuring the Extract Activity

Should you need to configure the Extract Activity beyond the default properties, you will need to be aware of each property and how it affects extraction. These properties are:

  • Mode: Choose how extraction handles existing data (Normal, Additive, or Recalculate).
    • Normal: Existing extraction results will be overwritten.
    • Additive: Data Elements that already have a value from previous extraction will not have those values overwritten — Grooper will only perform extraction for Data Elements that do not have extracted data already.
    • Recalculate: All calculated fields will be recalculated, and full validation performed.
  • Default Content Type: Set a fallback Content Type for unclassified folders.
  • Content Type Filter: Optionally restrict extraction to specific Content Types.
  • Data Element Filter: Optionally restrict extraction to specific Data Elements.
  • Rules: Add any Data Rules for post-processing or validation.
  • Flag Invalid Items: Enable to flag folders with validation errors.
  • Purge Alternate Candidates: Enable to remove alternate field values before saving.
  • Purge Empty Fields: Enable to remove empty fields before saving.
  • Stats Logging: Set the level of extraction statistics to record.

These properties can be configured in either Activity Properties panel, or by expanding the Activity property within the Step Properties panel.


FYI

When adding the Extract Activity to your Batch Process, it is crucial that the step be added AFTER Recognize. This will ensure that there is text and layout data to extract.

To test the Extract Activity on the Batch Process Step:

  1. Open your Batch Process in the Node Tree.
  2. Select the Extract Step.
  3. Go to the Activity Tester tab.
  4. Choose a Batch on whose documents you wish to text extraction.
  5. Click the play button.
    • Make sure that the document has been OCR'd and has a Content Type assigned to it before testing extraction.
  6. To see the results, select the "View Diagnostics" button.
  7. The diagnostics will open up in a new tab;

Extraction example

Suppose you have a Batch of invoices and want to extract key data for review and export. Your Data Model might include:

  • Data Field: "Invoice Number"
  • Data Field: "Invoice Date"
  • Data Table: "Line Items" (with columns for Description, Quantity, Unit Price, and Line Total)
  • Data Section: "Vendor Information" (with fields for Vendor Name, Address, and Phone)

By configuring the Extract Activity in your Batch Process, Grooper will automatically extract these values from each invoice, populate the Data Model, and make the data available for validation and export.

In this example, we will look at running the Extract Activity on a Batch of invoices. We'll start by testing the activity, and then seeing in in action during a Batch Process.


See also