2.90:Data Section (Node Type)

Data Sections are Data Elements of a Data Model. They allow a document's content to be subdivided into smaller portions (or "sections") for further processing, yielding the extraction process higher efficiency and accuracy.

Often, they are used to extract repeating sections of a document. For example, if a document had several sections of data for different customers, a Data Section could be used to pull data for each customer. This is especially useful for situations where the data within the section is predictable, but the number of sections in the document is not (i.e. if one document has one customer's data listed but the next has five, the next has two, and so on and so on).

Data Sections can also be used to:

Organize data from complex documents
Make a hierarchical representation of a document's structure, or
Reorder content from multiple columns on a page.

Data Sections may have, as its children:

Data Fields
Data Tables
Their own Data Sections

About

Sometimes a Data Field by itself just doesn't cut it when it comes time to extract data. Data Fields are the smallest building blocks of your Data Models. They are designed to return a single piece of data. For example, a most report stlye documents will have a single date the report was made. A single "Report Date" Data Field is well suited for this data.

However, what about repeated data across a single document? Say you have a document like this one. This is a standard reporting form oil and gas companies have to fill out and return to the Oklahoma Tax Commission for wells in production. One piece of information one might want to extract is the "Production Unit Number", which is essentially a tracking number relating to an oil and gas lease. But, there's not just one "Production Unit Number". There are five different ones. There's actually a set of information repeated in the sections of the document labeled "A", "B", "C" and so on.

It would be cumbersome to create five sets of Data Fields for each piece of data in each section.

Furthermore, for more unstructured or semi-structured documents, you may not reliably know how many sections are present per document. There might be one. There might be twenty. There could be variations of this form that have an "F", "G", and "H" section, for example. If you can't predict the number of Data Fields, how are you going to include them all in your Data Model?

This is exactly what Data Sections are for! Data Sections allow you to divide a document's content into smaller sections for further processing.

With a Data Section you can target these repeating portions of a document, creating five distinct sections out of them.

Then, all you need is a single Data Field for the repeating value you want to extract from each section.

Data Sections subdivide the larger document into smaller data instances. Data instances are an encapsulation of text data within the document. The largest data instance would be the document itself. Individual pages would be smaller sub-instances of the document level data instance. If you want to execute an extractor on page and not the whole document, you effectively execute it on the page instance of the document instance. Data Sections allow Grooper users to define how the document is subdivided to execute an extractor on a section instance of the document instance.

Rather than the Data Field (or other Data Element objects) executing against the whole document, it executes each data instance. It's like it creates smaller sub-documents or document chunks, ignoring all the text data outside of that chunk. Extractors used to populate Data Elements added to the Data Section will only execute against the text data contained in the Data Section. The rest of the document's text data is filtered out, narrowing the Data Elements' field of vision.

You can even subdivide a Data Section's data instance with another Data Section. This way you can create a hierarchy of data instances by adding child Data Sections to parent Data Sections in a Data Model. The parent Data Section is a subdivision of the document's data instance. The child Data Section is a subdivision of the parent Data Section's data instance. A child Data Section of the child's Data Section would be a subdivision of the child's Data Section. It's like making a Russian nesting doll out of the document's text data.

As with other Data Elements, Data Sections are created by adding them to a Data Model in a Content Model.

To add a Data Section right-click a Data Model.
Select "Add" then "Data Section..."
Name the Data Section on the subsequent popup window.
Press the "OK" button to finalize.

Section Extract Methods

How Grooper subdivides the document into the smaller data instances (or "section instances") is controlled by the Data Section's Extract Method property. Each Extract Method works a little differently to section out the document for subsequent extraction. They are as follows:

Full Page - This method subdivides the document into full pages. You can use a page filter to define which page or pages establish the section instances (ie the first page, or the second and fourth pages, or the fifth through the last pages). You can also use an extractor to extract a page or pages where the extractor returns a result.
Fixed - With this method, you establish the section instances by drawing a rectangular region on the document. Any text falling inside this rectangular zone forms the section instance. This method is useful for highly structured documents where you want to limit extraction to a specific area of a specific page of the document. This method will only ever return a single section instance.
Divider - This method uses an extractor similar to the Split Collation Method to establish section instances. A Divider Extractor is used to anchor the sections to an extractible result. The results the extractor returns can be used as the beginning point of the section or ending point. For example, a section header line may be used to indicate where one section begins. If the next section also uses that same section header, another section would be established. Sections can also be established between the Divider Extractor's results or (less commonly) around the results.
Geometric - This method uses a combination of extractors, positional adjustments, and line detection to establish rectangular regions for the section instances. Similar to the Fixed method, any text falling inside the rectangular zones forms the section instances. However, the Geometric method can produce multiple sections where the Fixed method only produces one. Furthermore, the Geometric method is always anchored to at least one extractor's result (the Main Extractor). The zone is expanded (or contracted) by adjusting the left, right, top and bottom edges of the zone using extractors or manually adjusting the length in inches or another unit. This method is useful for establishing sections from structured and semi-structured documents using columnar or atypical layouts.
Simple - This method uses a single extractor to return the section instances. One section is created for each result the extractor returns. This method is only "simple" in that it uses a single extractor to return the section. The extractor used to populate the section instances can be as complex as you create it, using any configuration of a Data Type extractor with the multitude of possibilities to return instances using any of the Collation Providers available. This method is also commonly used in unstructured document processing using Field Class extractors to create sections out of targeted paragraphs in a document's text.

To choose the Extract Method

Select a Data Section.
Select the Extract Method property.
Using the dropdown list, choose one of the available methods.

FixedFull PageDividerGeometricSimple

Fixed

In this example, we will demonstrate how to make a Data Section that returns a section for the highlighted portion of this document. This will limit the Data Section's Data Elements to return only data falling within this region.

To accomplish this we will use the Fixed method. In many ways, this sectioning method is the most basic. You simply draw a rectangular box around the portion of the document you want to form the section. All the text falling within this rectangular region will form the Data Section's section instance.

Furthermore, the Fixed method is the most basic in that only one section is established per document.

Here, we have selected a Data Section with the Extract Method set to Fixed.

Expand the Fixed Extract Method sub-properties.
Select the Bounds property to draw the rectangular boundaries of the zone for the section.
Press the ellipsis button at the end to bring up the zone editor.

Press the "Select Region" button to draw the zone.
Lasso the document with your cursor to place the zone. When finished it will appear as a green rectangle seen here.
- You can use the transform controls (the white boxes in the corners and edges) to edit the zone's dimensions.
- You can also use the General properties to manually enter the zone's position and the Size properties to manually enter the zone's size.
Press the "OK" button to finish placing the zone.

The Fixed extract method also requires you to indicate one which page the zone falls.

Using the Page Number property, enter one which page the zone falls.
- This is a two page document, and the section we've established falls on the first page. So, we've entered 1.

Grooper also gives you ways to verify the section instance (or instances) established by the Data Section. For more information on viewing the Data Sections section instances, visit the How To: Viewing the Section Instances section of this article.

For information on how to add Data Elements to a Data Section and how their extraction differs from standard full document extraction, visit the How To: Adding Data Elements to a Data Section section of this article.

Full Page

As its name implies, the Full Page section extraction method creates section instances out of full pages in a document. This can be useful to limit data extraction to a single page.

This can be as simple as indicating what page number you wish to create a section instance out of. You can also create multiple sections using the Full Page method by indicating multiple pages or a span of pages. This would create a section instance for every full page in the span (Note this means the Full Page section extraction method cannot create sections that span pages. You get a single full page per section instance. Top to bottom. No more, no less).

In this example, we will demonstrate how to create a Data Section returning the first page of this "Gross Production Monthly Tax Report" as the section instance.

Here, we have selected a Data Section with the Extract Method set to Full Page.

Expand the Full Page Extract Method sub-properties.
Select the Page Filter property to enter page numbers for each section instance.
- Here, we just want to create a single section instance out of the first page of each document. So, we have entered 1.

And that's it for the most basic Full Page section extraction method. In this case, a single section instance will be created encapsulating the first page of each document

FYI

The Full Page section extraction method can also create multiple sections if multiple pages are listed in the Page Filter. Below are some examples of page filters you may use to create multiple sections. One section will be established for each page.

Page Filter	Description
1 to 5	Five sections will be created from the first five pages of the document.
-1	One section will be created from the last page of the document.
-5 to -1	Five sections will be created from the last five pages of the document.
1, 2, 5	Three sections will be created from the first, second, and fifth pages.
1, 3 to 5	Four sections will be created from the first page and third through fifth pages.
1, -2	Two sections will be created from the first and next to last page.

Very straightforward. However, this configuration assumes your documents are both highly structured where you know what your looking for is on a particular page (or pages) and the document's pages are in order.

What if this document was scanned in out of order? The first page would be last and last would be first. Using the method described above, a section instance would be created from the wrong page.

There is a potential solution using the Full Page method's Extractor property. This allows you to target a page or pages with an extractor. If the extractor produces a result on a page, a section instance will be created out of the full page. This could be a referenced extractor (a Data Type or less commonly a Field Class) or an internal text pattern local to the Extractor property.

We can easily solve the page order problem described above with a simple extractor looking for the document's title "Gross Production Monthly Tax Report". Configured correctly, the regex will only match the actual first page of the document, even if its out of order.

Here, the Extractor property is configured as an Internal Type extractor, using the regex pattern below:
- \nGross Production Monthly Tax Report
Notice we don't use the Page Filter property in this case.
- You can choose to use either the Page Filter property or the Extractor property to produce section instances.
- You can also use both properties. However, both conditions must be met in order to produce a section instance. If you use a Page Filter and an Extractor and the extractor fails to produce a result on the listed page number, no section instance will be created.

The Divider, Geometric, and Simple section extraction methods get into the "meat and potatoes" functionality of Data Sections. As well as being able to target single-instance sections, they have increased functionality to target multiple repeating sections containing the same data.

We will target the repeating sections on this "Gross Production Monthly Tax Report" (henceforth called "Reporting Sections"). These sections can be targeted in different ways using any of these three section extraction methods. Their configurations are a little different, but at the end of the day, each of them can easily intuit the five reporting sections and their general boundaries.

Divider

The Divider method uses a functionality similar (identical even) to how a Data Type using the Split Collation Provider returns results to establish the section instances. There are two parts of the Divider method's operation.

An extractor is used to find a piece of text anchoring the sections position on the page.
- Generally, there is something that can be used to identify a section. There might be some kind of header labeling the section. There might be field labels you can use. There is something that identifies the section as a section, where it starts and/or stops. The extractor hones in on this point, anchoring a reference point for the section instance.
A "Split Position" is used to indicate how this extractor's anchoring result should be used to consume more text data as the section (or sections).
- Perhaps you have a header label for the section and everything after that label should be included in the section. Maybe you have a piece of text on one line and another ten lines below it and everything in between them should be included in the text. The split position allows you to control how the document is split into sections based on the extractors results.

For example, each section here starts with a letter followed by "8. Production Unit Number". Once you see that piece of text, you can keep on going down the document until you see another letter and "8. Production Unit Number" at which point a new section should start. We will use an extractor to match this text anchoring the start of each section and the Begin split position to indicate this text indicates where a section begins.

Here, we have selected a Data Section with the Extract Method set to Divider.

Expand the Divider Extract Method sub-properties.
The first thing you'll need to do is configure the Divider Extractor
- This can be a Reference or Internal extractor Type. We will use a simple Internal extractor.

This is the pattern we've configured for our Divider Extractor.

The Value Pattern is configured to return the text data at the start of each section.
- [A-Z] 8\. Production Unit Number
We get five results returned.
Each result will anchor our section instances to the beginning of each section.

Now that we have the Divider Extractor configured, we need to decide how to configure the Split Position property. This can be one of four options:

Begin - The extractor's result marks the beginning of each section. Starting at the extractors result, the section will consume all text in the document until the next result.
End - The extractor's result marks the end of each section. Starting at the top of the document, the section will consume all text until the extractors result. The next section will consume all text after until the next result.
Between - This split position requires at least two results from the extractor. The section consumes all text between the extractor's results. Importantly, the Begin and End split positions are inclusive of the extractor's result where Between is exclusive. That means the extractor's results will not be included in the section when using the Between split position.
Around - This is a less common split position. It will create sections on either side of the extractor's results. Imagine you had a single result extracting a line in the dead middle of the document. You would end up with two sections, one encapsulating everything before the result and one everything after. If your extractor produces two results, you'll end up with three sections: one from the top of the document up to the start of the first result, one from the end of the first result to the start of the second result, and one from the end of the second result to the end of the document. You will always end up with the number of results your extractor returns plus one. The Around split position is also exclusive. Results will be excluded from the sections produced.

To choose the split position, select the Split Position property.
Using the dropdown menu, choose one of the four split position options.
- In our case, the Divider Extractor's results match the start of each section. So, we've selected Begin.

Inspecting the section instances created, you can see this Data Section now establishes five section instances.
The sections start at our Divider Extractor's results and end when another result is returned.

For more information on viewing the Data Sections section instances, visit the How To: Viewing the Section Instances section of this article.

As a side note, check out the last section instance created by this configuration of the Divider method.
In the "Text View" tab, you can see it has quite a bit more text in this data instance than just what is in the section on the document. This is due to the nature of the Begin Split Position. This section instance was created using the last result returned by the Divider Extractor. Without another result after it, a section was created with this result as its starting point and then just kept on going until the end of the document.

Is this a big deal? Maybe. Maybe not. This will largely depend on your particular documents and their structure. In this case, it's probably not going to have much of an impact. We can still extract the data in the actual section. Having the extra text data doesn't really do anything one way or the other for us.

However, if it does impact your extraction results, you may need to use a different Split Position and/or Divider Extractor configuration or even section extraction method.

The Divider, Geometric, and Simple section extraction methods get into the "meat and potatoes" functionality of Data Sections. As well as being able to target single-instance sections, they have increased functionality to target multiple repeating sections containing the same data.

We will target the repeating sections on this "Gross Production Monthly Tax Report" (henceforth called "Reporting Sections"). These sections can be targeted in different ways using any of these three section extraction methods. Their configurations are a little different, but at the end of the day, each of them can easily intuit the five reporting sections and their general boundaries.

Geometric

The Geometric method uses a variety of tools in Grooper's toolbox to create a logical rectangular region around the sections you wish to extract. Any and all text falling within these rectangular zones form the section instances. The basic process to draw these zones is two-fold.

A Main Extractor is configured to place the initial zones.
- This serves as the starting point for the zone's location. This could encapsulate the entire section. Or, it could serve as an anchor similar to how we used the label "8. Production Unit Number" to anchor the beginning of each section using the Divider method.
The zones' sizes are adjusted using the Top, Bottom, Left, and Right Adjustment properties.
- You can configure these properties to expand or contract the size of the initial zones established by the Main Extractor'. You can use extractors to anchor the edges of the zones to text data or manually adjust the zone's length in inches, centimeters, millimeters, or points.

Optionally, if your sections are encased in lines, such as they are on this document, you can use Grooper's line detection to expand the zone's boundaries to nearby lines.

Here, we have selected a Data Section with the Extract Method set to Geometric.

Expand the Geometric Extract Method sub-properties.
The first thing you'll need to do is configure the Main Extractor
- In this case, we used a simple internal pattern, matching the label "8. Production Unit Number" and the section letter.
- [A-Z] 8\. Production Unit Number
This gives us a starting point for the sections' locations. But if we were to execute this Data Section as is, we'd just end up with five different sections with only "A 8. Production Unit Number", "B 8. Production Unit Number" and so on as their text data. We need to expand out the zones' borders to fully cover each section.

We will expand out the zones' sizes using the various Adjustment properties. There are three options for the borders' Adjustments:

Absoulte - This will expand or contract the border to a specific, fixed point on the page.
Edge of Page - This will expand the border to the top, bottom, left or right edge of the page (depending on if you choose the Top, Bottom, Left, or Right Adjustment property)
Anchor - With this option, you can use an extractor to expand or contract the border to an extraction result.

All of these options also have a Manual Adjustment property to expand or contract the border even further by a set length in inches or another unit.

In this case, the Top and Left Adjustment is fine as is.

The Main Extractor's result is itself at the very top line of the section. There's no need to configure a Top Adjustment.
The left edge of the results are also aligned with the left edge of each section. There's also no need to configure a Left Adjustment.

For the Right Adjustment we will choose Edge of Page.
This will expand the section instances all the way to the right edge of the page.

The Divider, Geometric, and Simple section extraction methods get into the "meat and potatoes" functionality of Data Sections. As well as being able to target single-instance sections, they have increased functionality to target multiple repeating sections containing the same data.

We will target the repeating sections on this "Gross Production Monthly Tax Report" (henceforth called "Reporting Sections"). These sections can be targeted in different ways using any of these three section extraction methods. Their configurations are a little different, but at the end of the day, each of them can easily intuit the five reporting sections and their general boundaries.

Simple

How To

Viewing the Section Instances

When configuring any of the section extraction methods, it can be useful to verify what section instances are created. Where they are physically on the document and what text data they contain. The "Instance View" tab is extremely helpful when testing out your Data Section configurations to do just this.

Switch to the "Instance View" tab.
Press the "Test Extraction" button.
Expand the Data Section to view the created section instances.
Select a section instance to view its results.
- In this case we used the Fixed section extraction method described above. With the Fixed method only creating one instance, we only see one section instance as a child of the Data Section. For other methods that create multiple section instances, you will see one child for each section instance created here.
- If you see no child section instances here, your Extract Method configuration has failed to produce any sections.
In the "Image View" tab, you can visually verify where the section instance is on the document. A red outlined border will surround the selected section instance.
- Note: This border is somewhat faint and can be difficult to see in Grooper. It is exaggerated in this image for the purpose of illustration.

If you switch to the Instance View's "Text View" tab, you can see the section instance's text data.
- All extraction performed by the section's Data Elements will run against this text data.

Adding Data Elements to a Data Section

Data Sections can have Data Fields, Data Tables, and even Data Sections as their child Data Elements. You add these Data Elements to the Data Section just like you do with a Data Model For this example, we will add a Data Field for the "Company Reporting Number" located in the Data Section we created. Right click the Data Section Select "Add" then the Data Element you wish to add. In this case, we will choose "Data Field..." to add a Data Field. Name the Data Element as you wish. Press "Ok" to finalize the Data Element's creation.
This adds the Data Element as a child of the Data Section. In this case, we added a Data Field named "Company Reporting Number" From here, the Data Element is configured as you would if it were added as a direct child of a Data Model. We're going to use a simple Text Pattern extractor to return the five digit Company Reporting Number The difference is as a child of the Data Section it will run against the section instance (or instances) established by the Data Section rather than the full document.
We're going use a very general pattern to illustrate this point. You can see here the results of the configured Text Pattern extractor used for the Data Field The Value Pattern regex `\d{5}` matches a string of five digits. The Prefix and Suffix Patterns `\s` will only return results where the string of five digits is bookended by a whitespace character on either end. This does give us the value we want, the "Company Reporting Number" of "55097" But it gives us all kinds of junk too, like five digit zip codes. Furthermore, run at the document level, it's not the first result. This extractor would fail to produce the right result if it was run outside the Data Section. But as we'll see next, with this extractor running inside the Data Section, this extractor will work just fine. It will execute against the section instance not the document instance.
Now that the Data Field is added and configured to return a result, we can verify it only executes against the section instance created by the Data Section. Select the Data Section level in the Data Model's hierarchy to test extraction for its child Data Elements (in this case just the "Company Reporting Number" Data Field). Press the "Test Extraction" button. With the Scope property set to the default of MultiInstance, you will see the total number of section instances created by the Data Section. You can also use the arrows here to navigate through the sections. Although in this case, only a single section instance was created. Each section is populated with the results of its child Data Elements. Here, you can see our lone child Data Field named "Company Reporting Number". The Data Elements execute against the section instance. Now, the extractor built for the "Company Reporting Number" Data Field returns the right result. The extractor only executes against the section instance, ignoring all other text data outside of it. The only matching result is the one returned in this case. You can also see the section instance's location highlighted outlined in blue. The returned result is the only thing the extractor matches within this data instance. Data that was being returned by the extractor, such as the five digit zip code, falls outside of this data instance, and is therefore not returned. Note: This border is somewhat faint and can be difficult to see in Grooper. It is exaggerated in this image for the purpose of illustration.