2.90:Data Section (Node Type)

Data Sections are Data Elements of a Data Model. They allow a document's content to be subdivided into smaller portions (or "sections") for further processing, yielding the extraction process higher efficiency and accuracy.
Often, they are used to extract repeating sections of a document. For example, if a document had several sections of data for different customers, a Data Section could be used to pull data for each customer. This is especially useful for situations where the data within the section is predictable, but the number of sections in the document is not (i.e. if one document has one customer's data listed but the next has five, the next has two, and so on and so on).
Data Sections can also be used to:
- Organize data from complex documents
- Make a hierarchical representation of a document's structure, or
- Reorder content from multiple columns on a page.
Data Sections may have, as its children:
- Data Fields
- Data Tables
- Their own Data Sections
About
|
As with other Data Elements, Data Sections are created by adding them to a Data Model in a Content Model.
|
Section Extract Methods
How Grooper subdivides the document into the smaller data instances (or "section instances") is controlled by the Data Section's Extract Method property. Each Extract Method works a little differently to section out the document for subsequent extraction. They are as follows:
- Full Page - This method subdivides the document into full pages. You can use a page filter to define which page or pages establish the section instances (ie the first page, or the second and fourth pages, or the fifth through the last pages). You can also use an extractor to extract a page or pages where the extractor returns a result.
- Fixed - With this method, you establish the section instances by drawing a rectangular region on the document. Any text falling inside this rectangular zone forms the section instance. This method is useful for highly structured documents where you want to limit extraction to a specific area of a specific page of the document. This method will only ever return a single section instance.
- Divider - This method uses an extractor similar to the Split Collation Method to establish section instances. A Divider Extractor is used to anchor the sections to an extractible result. The results the extractor returns can be used as the beginning point of the section or ending point. For example, a section header line may be used to indicate where one section begins. If the next section also uses that same section header, another section would be established. Sections can also be established between the Divider Extractor's results or (less commonly) around the results.
- Geometric - This method uses a combination of extractors, positional adjustments, and line detection to establish rectangular regions for the section instances. Similar to the Fixed method, any text falling inside the rectangular zones forms the section instances. However, the Geometric method can produce multiple sections where the Fixed method only produces one. Furthermore, the Geometric method is always anchored to at least one extractor's result (the Main Extractor). The zone is expanded (or contracted) by adjusting the left, right, top and bottom edges of the zone using extractors or manually adjusting the length in inches or another unit. This method is useful for establishing sections from structured and semi-structured documents using columnar or atypical layouts.
- Simple - This method uses a single extractor to return the section instances. One section is created for each result the extractor returns. This method is only "simple" in that it uses a single extractor to return the section. The extractor used to populate the section instances can be as complex as you create it, using any configuration of a Data Type extractor with the multitude of possibilities to return instances using any of the Collation Providers available. This method is also commonly used in unstructured document processing using Field Class extractors to create sections out of targeted paragraphs in a document's text.
|
To choose the Extract Method
|
Fixed
|
In this example, we will demonstrate how to make a Data Section that returns a section for the highlighted portion of this document. This will limit the Data Section's Data Elements to return only data falling within this region. To accomplish this we will use the Fixed method. In many ways, this sectioning method is the most basic. You simply draw a rectangular box around the portion of the document you want to form the section. All the text falling within this rectangular region will form the Data Section's section instance. Furthermore, the Fixed method is the most basic in that only one section is established per document. |
|
Here, we have selected a Data Section with the Extract Method set to Fixed.
|
|
|
|
|
The Fixed extract method also requires you to indicate one which page the zone falls.
|
Grooper also gives you ways to verify the section instance (or instances) established by the Data Section. For more information on viewing the Data Sections section instances, visit the How To: Viewing the Section Instances section of this article.
For information on how to add Data Elements to a Data Section and how their extraction differs from standard full doucment extraction, visit the How To: Adding Data Elements to a Data Section section of this article.
Full Page
Divider
The Divider, Geometric, and Simple section extraction methods get into the "meat and potatoes" functionality of Data Sections. As well as being able to target single-instance sections, they have increased functionality to target multiple repeating sections containing the same data.
We will target the repeating sections on this "Gross Production Monthly Tax Report" (henceforth called "Reporting Sections"). These sections can be targeted in different ways using any of these three section extraction methods. Their configurations are a little different, but at the end of the day, each of them can easily intuit the five reporting sections and their general boundaries.
Geometric
The Divider, Geometric, and Simple section extraction methods get into the "meat and potatoes" functionality of Data Sections. As well as being able to target single-instance sections, they have increased functionality to target multiple repeating sections containing the same data.
We will target the repeating sections on this "Gross Production Monthly Tax Report" (henceforth called "Reporting Sections"). These sections can be targeted in different ways using any of these three section extraction methods. Their configurations are a little different, but at the end of the day, each of them can easily intuit the five reporting sections and their general boundaries.
Simple
The Divider, Geometric, and Simple section extraction methods get into the "meat and potatoes" functionality of Data Sections. As well as being able to target single-instance sections, they have increased functionality to target multiple repeating sections containing the same data.
We will target the repeating sections on this "Gross Production Monthly Tax Report" (henceforth called "Reporting Sections"). These sections can be targeted in different ways using any of these three section extraction methods. Their configurations are a little different, but at the end of the day, each of them can easily intuit the five reporting sections and their general boundaries.
How To
Viewing the Section Instances
|
When configuring any of the section extraction methods, it can be useful to verify what section instances are created. Where they are physically on the document and what text data they contain. The "Instance View" tab is extremely helpful when testing out your Data Section configurations to do just this.
|
|
|
Adding Data Elements to a Data Section
|
Data Sections can have Data Fields, Data Tables, and even Data Sections as their child Data Elements. You add these Data Elements to the Data Section just like you do with a Data Model For this example, we will add a Data Field for the "Company Reporting Number" located in the Data Section we created.
|
|
|
|
|
We're going use a very general pattern to illustrate this point. You can see here the results of the configured Text Pattern extractor used for the Data Field
|
|
|
Now that the Data Field is added and configured to return a result, we can verify it only executes against the section instance created by the Data Section.
|















