2.90:Data Section (Node Type)

From Grooper Wiki

Data Sections are Data Elements of a Data Model. They allow a document's content to be subdivided into smaller portions (or "sections") for further processing, yielding the extraction process higher efficiency and accuracy.

Often, they are used to extract repeating sections of a document. For example, if a document had several sections of data for different customers, a Data Section could be used to pull data for each customer. This is especially useful for situations where the data within the section is predictable, but the number of sections in the document is not (i.e. if one document has one customer's data listed but the next has five, the next has two, and so on and so on).

Data Sections can also be used to:

  • Organize data from complex documents
  • Make a hierarchical representation of a document's structure, or
  • Reorder content from multiple columns on a page.

Data Sections may have, as its children:


About

Sometimes a Data Field by itself just doesn't cut it when it comes time to extract data. Data Fields are the smallest building blocks of your Data Models. They are designed to return a single piece of data. For example, a most report stlye documents will have a single date the report was made. A single "Report Date" Data Field is well suited for this data.


However, what about repeated data across a single document? Say you have a document like this one. This is a standard reporting form oil and gas companies have to fill out and return to the Oklahoma Tax Commission for wells in production. One piece of information one might want to extract is the "Production Unit Number", which is essentially a tracking number relating to an oil and gas lease. But, there's not just one "Production Unit Number". There are five different ones. There's actually a set of information repeated in the sections of the document labeled "A", "B", "C" and so on.

It would be cumbersome to create five sets of Data Fields for each piece of data in each section.

Furthermore, for more unstructured or semi-structured documents, you may not reliably know how many sections are present per document. There might be one. There might be twenty. There could be variations of this form that have an "F", "G", and "H" section, for example. If you can't predict the number of Data Fields, how are you going to include them all in your Data Model?

This is exactly what Data Sections are for! Data Sections allow you to divide a document's content into smaller sections for further processing.

With a Data Section you can target these repeating portions of a document, creating five distinct sections out of them.

Then, all you need is a single Data Field for the repeating value you want to extract from each section.


Data Sections subdivide the larger document into smaller data instances. Data instances are an encapsulation of text data within the document. The largest data instance would be the document itself. Individual pages would be smaller sub-instances of the document level data instance. If you want to execute an extractor on page and not the whole document, you effectively execute it on the page instance of the document instance. Data Sections allow Grooper users to define how the document is subdivided to execute an extractor on a section instance of the document instance.

Rather than the Data Field (or other Data Element objects) executing against the whole document, it executes each data instance. It's like it creates smaller sub-documents or document chunks, ignoring all the text data outside of that chunk. Extractors used to populate Data Elements added to the Data Section will only execute against the text data contained in the Data Section. The rest of the document's text data is filtered out, narrowing the Data Elements' field of vision.

You can even subdivide a Data Section's data instance with another Data Section. This way you can create a hierarchy of data instances by adding child Data Sections to parent Data Sections in a Data Model. The parent Data Section is a subdivision of the document's data instance. The child Data Section is a subdivision of the parent Data Section's data instance. A child Data Section of the child's Data Section would be a subdivision of the child's Data Section. It's like making a Russian nesting doll out of the document's text data.

As with other Data Elements, Data Sections are created by adding them to a Data Model in a Content Model.

  1. To add a Data Section right-click a Data Model.
  2. Select "Add" then "Data Section..."
  3. Name the Data Section on the subsequent popup window.
  4. Press the "OK" button to finalize.


Section Extract Methods

How Grooper subdivides the document into the smaller data instances (or "section instances") is controlled by the Data Section's Extract Method property. Each Extract Method works a little differently to section out the document for subsequent extraction. They are as follows:

  • Full Page - This method subdivides the document into full pages. You can use a page filter to define which page or pages establish the section instances (ie the first page, or the second and fourth pages, or the fifth through the last pages). You can also use an extractor to extract a page or pages where the extractor returns a result.
  • Fixed - With this method, you establish the section instances by drawing a rectangular region on the document. Any text falling inside this rectangular zone forms the section instance. This method is useful for highly structured documents where you want to limit extraction to a specific area of a specific page of the document. This method will only ever return a single section instance.
  • Divider - This method uses an extractor similar to the Split Collation Method to establish section instances. A Divider Extractor is used to anchor the sections to an extractible result. The results the extractor returns can be used as the beginning point of the section or ending point. For example, a section header line may be used to indicate where one section begins. If the next section also uses that same section header, another section would be established. Sections can also be established between the Divider Extractor's results or (less commonly) around the results.
  • Geometric - This method uses a combination of extractors, positional adjustments, and line detection to establish rectangular regions for the section instances. Similar to the Fixed method, any text falling inside the rectangular zones forms the section instances. However, the Geometric method can produce multiple sections where the Fixed method only produces one. Furthermore, the Geometric method is always anchored to at least one extractor's result (the Main Extractor). The zone is expanded (or contracted) by adjusting the left, right, top and bottom edges of the zone using extractors or manually adjusting the length in inches or another unit. This method is useful for establishing sections from structured and semi-structured documents using columnar or atypical layouts.
  • Simple - This method uses a single extractor to return the section instances. One section is created for each result the extractor returns. This method is only "simple" in that it uses a single extractor to return the section. The extractor used to populate the section instances can be as complex as you create it, using any configuration of a Data Type extractor with the multitude of possibilities to return instances using any of the Collation Providers available. This method is also commonly used in unstructured document processing using Field Class extractors to create sections out of targeted paragraphs in a document's text.

To choose the Extract Method

  1. Select a Data Section.
  2. Select the Extract Method property.
  3. Using the dropdown list, choose one of the available methods.

Fixed

In this example, we will demonstrate how to make a Data Section that returns a section for the highlighted portion of this document. This will limit the Data Section's Data Elements to return only data falling within this region.

To accomplish this we will use the Fixed method. In many ways, this sectioning method is the most basic. You simply draw a rectangular box around the portion of the document you want to form the section. All the text falling within this rectangular region will form the Data Section's section instance.

Furthermore, the Fixed method is the most basic in that only one section is established per document.

Here, we have selected a Data Section with the Extract Method set to Fixed.

  1. Expand the Fixed Extract Method sub-properties.
  2. Select the Bounds property to draw the rectangular boundaries of the zone for the section.
  3. Press the ellipsis button at the end to bring up the zone editor.

Full Page

Divider

Geometric

Simple