Extract (Activity): Difference between revisions
No edit summary |
No edit summary |
||
Line 2: | Line 2: | ||
<blockquote>{{#lst:Glossary|Extract}}</blockquote> | <blockquote>{{#lst:Glossary|Extract}}</blockquote> | ||
== About == | |||
Data extraction is configured using '''Data Model''' objects in a '''Content Model'''. This is where you define the data elements you wish to extract from your documents. Appropriately, you define the data to be extracted by adding '''Data Element''' objects to the '''Data Model'''. There are three main '''Data Elements''': | |||
* '''Data Field''' | |||
* '''Data Section''' | |||
* '''Data Table''' | |||
** '''Data Tables''' are also configured with their own special child '''Data Element''': The '''Data Column''' object. | |||
The '''Data Field''' object is the simplest '''Data Element'''. This will allow you to extract a simple list of fields (Such as "Invoice Date", "Invoice Number", "Invoice Amount", etc.). | |||
The '''Data Table''' object allows you to extract [[Table Extraction|tabular]] data. Tables are more complex than simple fields, in that they are a repeating series of fields organized into rows and columns. This requires a more robust '''Data Element''' to describe this data structure; hence, the addition of the '''Data Table''' object along with it's child '''Data Column''' objects. | |||
The '''Data Section''' object allows you to extract '''Data Fields''' and/or '''Data Tables''' in repeating sections of a document. '''Data Sections''' may even have their own child '''Data Sections'''. This allows you to divide your document into sections and sub-sections, giving your '''Data Model''' its own levels of data hierarchy. | |||
When the '''Extract''' activity runs, it will populate the '''Data Model''' with values extracted from the document's text data (obtained from the '''[[Recognize]]''' activity). How this text is located and returned is determined by the extraction configurations set on each '''Data Element'''. | |||
=== Data Extractors === | |||
After defining what '''Data Elements''' you want to extract, you need to define ''how'' to populate those fields, tables, and sections with data. This is done with [[Data Extractor]]s, often shorthanded to just "extractors". | |||
== Data Hierarchy == | |||
As discussed earlier, you can create hierarchical relationships within a single '''Data Model''' using '''Data Sections''' and '''Data Tables'''. As a direct child of a '''Data Model''' a '''Data Field''' will execute against the entire document. However, as a child of a '''Data Section''' a '''Data Field''' will only execute against the portion of the document described by that '''Data Section'''. | |||
'''Data Models''' also benefit from a '''Content Model's''' inheritance structure. For example, the '''Content Model''' itself may have a '''Data Model''' but a '''Document Type''' may also have its own '''Data Model'''. The '''Document Type''', as a child of the '''Content Model''', will inherit all '''Data Elements''' from the parent '''Content Model's''' '''Data Model.''' | |||
== Glossary == | == Glossary == | ||
Line 29: | Line 55: | ||
<u><big>'''Recognize'''</big></u>: {{#lst:Glossary|Recognize}} | <u><big>'''Recognize'''</big></u>: {{#lst:Glossary|Recognize}} | ||
Revision as of 11:09, 26 August 2024
STUB |
This article is a stub. It contains minimal information on the topic and should be expanded. |
export_notes Extract is an Activity that retrieves information from folder Batch Folder documents, as defined by Data Elements in a data_table Data Model. This is how Grooper locates unstructured data on your documents and collects it in a structured, usable format.
About
Data extraction is configured using Data Model objects in a Content Model. This is where you define the data elements you wish to extract from your documents. Appropriately, you define the data to be extracted by adding Data Element objects to the Data Model. There are three main Data Elements:
- Data Field
- Data Section
- Data Table
- Data Tables are also configured with their own special child Data Element: The Data Column object.
The Data Field object is the simplest Data Element. This will allow you to extract a simple list of fields (Such as "Invoice Date", "Invoice Number", "Invoice Amount", etc.).
The Data Table object allows you to extract tabular data. Tables are more complex than simple fields, in that they are a repeating series of fields organized into rows and columns. This requires a more robust Data Element to describe this data structure; hence, the addition of the Data Table object along with it's child Data Column objects.
The Data Section object allows you to extract Data Fields and/or Data Tables in repeating sections of a document. Data Sections may even have their own child Data Sections. This allows you to divide your document into sections and sub-sections, giving your Data Model its own levels of data hierarchy.
When the Extract activity runs, it will populate the Data Model with values extracted from the document's text data (obtained from the Recognize activity). How this text is located and returned is determined by the extraction configurations set on each Data Element.
Data Extractors
After defining what Data Elements you want to extract, you need to define how to populate those fields, tables, and sections with data. This is done with Data Extractors, often shorthanded to just "extractors".
Data Hierarchy
As discussed earlier, you can create hierarchical relationships within a single Data Model using Data Sections and Data Tables. As a direct child of a Data Model a Data Field will execute against the entire document. However, as a child of a Data Section a Data Field will only execute against the portion of the document described by that Data Section.
Data Models also benefit from a Content Model's inheritance structure. For example, the Content Model itself may have a Data Model but a Document Type may also have its own Data Model. The Document Type, as a child of the Content Model, will inherit all Data Elements from the parent Content Model's Data Model.
Glossary
Activity: Activity is a property on edit_document Batch Process Steps. Activities define specific document processing operations done to a inventory_2 Batch, folder Batch Folder, or contract Batch Page. Batch Process Steps configured with specific Activities are frequently referred by the name of the Activity followed by the word "step". For example: Classify step.
Batch Folder: folder Batch Folder objects are defined as container objects within a inventory_2 Batch that are used to represent and organize both folders and pages. They can hold other Batch Folders or contract Batch Page objects as children. The Batch Folder acts as an organizational unit within a Batch, allowing for a structured approach to managing and processing a collection of documents.
- Batch Folders are frequently referred to simply as "documents".
Content Model: stacks Content Model node objects define the taxonomy of document sets in terms of the description Document Type they contain. They also house the Data Elements that appear on each collections_bookmark Content Category and Document Type within them. Content Models serve as the root of a Content Type hierarchy and are crucial for organizing the different types of documents that Grooper can recognize and process.
Data Column: view_column Data Column node objects are child objects of a table Data Table, representing individual columns and defining the type of data each column holds along with its data extraction properties.
Data Element: Data Element refers to the objects in Grooper used to collect data from a document. These include: data_table Data Models, insert_page_break Data Sections, variables Data Fields, table Data Tables, and view_column Data Columns.
Data Extractor: Data Extractor (or just "extractor") refers to all Extractor Types and extractor node objects. Extractors define the logic used to return data from a document's text content, including general data (such as a date) and specific data (such as an agreement date on a contract).
Data Field: variables Data Field node objects are created as child objects of a data_table Data Model. A Data Field is a representation of a single piece of data targeted for extraction on a document.
Data Fields are frequently referred to simply as "fields".
Data Model: data_table Data Model node objects serve as the top-tier structure defining the taxonomy for Data Elements and are leveraged during the Extract Activity to extract data from a folder Batch Folders. They are a hierarchy of Data Elements that sets the stage for the extraction logic and review of data collected from documents.
Data Section: insert_page_break Data Section objects are grouping mechanisms for related variables Data Fields. Data Sections organize and segment child Data Elements into logical divisions of a document based on the structure and semantics of the information the documents contain.
Data Table: table Data Table objects are utilized for extracting repeating data that's formatted in rows and columns, allowing for complex multi-instance data organization that would be present in table-formatted content.
Document Type: description Document Type objects represent a distinct type of document, like an invoice or contract. Document Types are created as children of a stacks Content Model or a collections_bookmark Content Category and are used to classify individual folder Batch Folders. Each Document Type in the hierarchy defines the Data Elements and Behaviors that apply to Batch Folders of that specific classification.
Extract: export_notes Extract is an Activity that retrieves information from folder Batch Folder documents, as defined by Data Elements in a data_table Data Model. This is how Grooper locates unstructured data on your documents and collects it in a structured, usable format.
Recognize: format_letter_spacing_wide Recognize is an Activity that obtains machine-readable text from contract Batch Pages and folder Batch Folders. Recognize will selectively perform OCR for images and native-text extraction for digital text in PDFs. Recognize can also be configured to collect "layout data" like lines, checkboxes, and barcodes. Various other Activities then use this machine-readable text and layout data for document analysis and data extraction.