2023:Labeling Behavior (Behavior)

From Grooper Wiki
Revision as of 09:51, 28 September 2023 by Rpatton (talk | contribs)

WIP

This article is a work-in-progress or created as a placeholder for testing purposes. This article is subject to change and/or expansion. It may be incomplete, inaccurate, or stop abruptly.

This tag will be removed upon draft completion.

The Labeling Behavior is a Content Type Behavior designed to collect and utilize a document's field labels in a variety of ways. This includes functionality for classification and data extraction.

Previous Versions

Grooper 2021

The Labeling Behavior functionality allows Grooper users to quickly onboard new Document Types for structured and semi-structured forms, utilizing labels as a thumbprint for classification and data extraction purposes. Once the Labeling Behavior is enabled, labels are identified and collected using the "Labels" tab of Document Types. These "Label Sets" can then be used for the following purposes:

  • Document classification - Using the Labelset-Based Classification Method
  • Field based data extraction - Primarily using the Labeled Value Extractor Type
  • Tabular data extraction - Primarily using a Data Table object's Tabular Layout Extract Method
  • Sectional data extraction - Primarily using a Data Section object's Transaction Detection Extract Method
FYI The Labeling Behavior and its functionality discussed in this article are often referred to as "Label Set Behavior" or simply "Label Sets".


About

Labels serve an important function on documents. They give the reader critical context to understand where data is located and what it means. How do you know the difference between the date on an invoice document indicating when the invoice was sent and the date indicating when you should pay the invoice? It's the labels. The labels are what distinguishes one type of date from another. For example, "Invoice Date" for the date the invoice was sent and "Due Date" for the date you need to pay by.

Labels can be a way of classifying documents as well. What does one individual label tell you about a document? Well, maybe not much. However, if you take them all together, they can tell you quite a bit about the kind of document you're looking at. For example, a W-4 employee withholding form is going to use different labels than an employee healthcare enrollment form. These are two very different documents collecting very different information. The labels used to collect this information are thus different as well.

Furthermore, you can even tell the difference between two very closely related documents using labels as well. For example, two different invoices from two different vendors may share some similarity in the labels they use to detail information. But there will be some differences as well. These differences can be useful identifiers to distinguish one from the other. Put all together, labels can act as a thumbprint Grooper can use to classify a document as one Document Type or another.

Even though these two invoices share some labels (highlighted in blue), there are others that are unique to each one (highlighted in yellow). This awareness of how one kind of invoice from one vendor uses labels differently from another can give you a method of classifying these documents using their label sets.


The Labeling Behavior is built on these concepts, collecting and utilizing labels for Document Types in a Content Model for classification and data extraction purposes.

As a Behavior, the Labeling Behavior is enabled on a Content Type object in Grooper.

While you can enable Labeling Behavior on any Content Type, in almost all cases, you will want to enable this Behavior on the Content Model.

Typically, you want to collect and use label sets for multiple Document Types in the Content Model, not just one Document Type individually. Enabling the Behavior on the Content Model will enable the Labeling Behavior for all child Document Types, allowing you to collect and utilize labels for all Document Types.

  1. Here, we have selected a Content Model in the Node Tree.
  2. To add a Behavior, select the Behaviors property and press the ellipsis button at the end.

  1. This will bring up a dialogue window to add various behaviors to the Content Model, including the Labeling Behavior
  2. Add the Labeling Behavior using the "Add" button.
  3. Select Labeling Behavior from the listed options.

  1. Once added, you will see a Labeling Behavior item added to the Behaviors list.
  2. Selecting the Labeling Behavior in the list, you will see property configuration options in the right panel.
    • The configuration options in the property panel pertain to fuzzy matching collected labels as well as constrained and vertical wrapping capabilities to target stacked labels.
    • By default, Grooper presumes you will want to use some fuzzy matching and enable constrained and vertical wrapping. These defaults work well for most use cases. However, you can adjust these properties here as needed.
  3. Press the "OK" button to finish adding the Labeling Behavior and exit this window.

  1. Now on the Content Model tab you should see a Behavior is set.
  2. Save your changes

Once the Labeling Behavior is enabled, the next big step is collecting label sets for the various Document Types in your Content Model.

  1. With the Labeling Behavior enabled, you will now see a "Labels" tab present for the Content Model.
    • This tab is also now present for each individual Document Type as well.
  2. Label sets are collected in this tab for each Document Type in the Content Model.

Each Document Type has its own set of labels used to define information on the document. For example, the "Factura" Document Type in this Content Model uses the label "PO Number" to call out the purchase order number on this invoice document. A different Document Type, corresponding to a different invoice format, might use a different label such as "Purchase Order Number" or "PO #".

  1. Ultimately, this is the data we want to collect using the Content Model's Data Model.
  2. We use the "Labels" tab to collect labels corresponding to the various Data Elements (Data Fields, Data Tables, and Data Sections) of the Data Model.
    • This provides a user interface to enter a label identifying the value you wish to collect for the Data Elements.
  3. For example, the label "PO Number" identifies the purchase order number for this invoice.
  4. Therefore, the label "PO Number" is collected for the "Purchase Order Number" Data Field in the Data Model.

For more information on collecting label sets for the Document Types in your Content Model see the How To section of this article.

In Grooper 2023, for labels to show up in the Labels tab of the Content Model, a Label Set aware Value Extractor (such as Labeled Value or Tabular Layout) must be set on your Data Fields and Data Tables.

Once label sets are collected for each Document Type, they can be used for classification and data extraction purposes.

For example, labels were used in this case to:

  1. Classify the documents, assigning each document the appropriate Document Type.
  2. Extract all the Data Fields seen here, collecting field based data from the document.

  1. Extract the "Line Items" Data Table, collecting the tabular data seen here.

For more information on how to use labels for these purposes, see the How To section of this article.

How To

The Labeling Behavior (often referred to as "Label Set Behavior" or just "Label Sets") are well suited for structured and semi-structured document sets. Label Sets are particularly useful for situations where you have multiple variations for one kind of document or another. While the information you want to extract from the document set may be the same from variation to variation, how the data is laid out and labeled may be very different from one variation of the document to another. Label Sets allow you to quickly onboard new Document Types to capture new form structures.

We will use invoices for the document set in the following tutorials.

In a perfect world, you'd create a Content Model with a single "Invoice" Document Type whose Data Model would successfully extract all Data Elements for all invoices from all vendors every time no matter what.

This is often not the case. You may find you need to add multiple Document Types to account for variations of an invoice from multiple vendors. Label Sets give you a method of quickly adding to Document Types to model new variations. In our case, we will presume we need to create one Document Type for each vendor.

We will start with five Document Types for invoices from five vendors.

  • Factura
  • Lasku
  • Envoy
  • Rechnung
  • Arve

You may download and import the file below into your own Grooper environment (version 2021). This contains the Batch(es) with the example document(s) discussed in this tutorial and the Content Model(s) configured according to the instructions.

Collect Label Sets

Navigate to the Labels UI

Collecting labels for the Document Types in your Content Model will be the first thing you want to do after enabling the Labeling Behavior. Labels for each Data Element in the Document Type's Data Model are defined using the "Labels" tab of the Content Model.

  1. Select the desired Content Model.
  2. Navigate to the "Labels" tab.
  3. With a Batch selected in the "Test Batch" window panel, select a document folder.

  1. Right click on the folder.
  2. Click "Assign Document Type.."

  1. When the "Assign Document Type" window pops up, click the hamburger button on the right side of the Content Type property.
  2. From the drop down, find the Content Model you are using and select the correct Document Type for the document.

  1. Click "Execute" to assign the Document Type.

FYI If you haven't added a Document Type for the selected document folder yet, you can click the "+" (plus sign) button at the top of the Labels UI to both create a Document Type and assign it to the document that is currently selected.

  1. Now that the Document Type has been set on the document, we have a couple of elements showing in our "LABELS" UI. However, we are not seeing any Data Fields or the Data Table objects.
  2. The eye icon at the top of the "Labels" tab hides all fields that do not have a label set on them by default. Click the eye button to show these fields.

  1. Now all objects are showing in the "LABELS" UI.

Click me to return to the top

Collect Field Labels

Now that this document has been classified (assigned a Document Type from our Content Model), we can collect labels for its Document Type. This can be done in one of three ways:

  1. Lassoing the label in the "Document Viewer".
  2. Double-clicking the label in the Document Viewer.
  3. Typing the label in manually.
Going forward, this tutorial presumes you have obtained machine readable text from these documents, either OCR'd text or native text, via the Recognize activity.

Generally the quickest way is by simply lassoing the label in the "Document Viewer".

  1. Select the Data Element whose label you wish to collect.
    • Here, we are selecting the "Invoice Number" Data Field.
  2. Click the "Rubberband Label" button.
  3. With your cursor, lasso around the text label on the document.

  1. Upon lassoing the label in the Document Viewer, the OCR'd or native text behind the selected region will be used to populate the Data Element's label.
    • At this point, the label for the "Invoice Number" Data Field is now "Invoice Number" because that's the text data we selected. Whatever text characters you lasso with your cursor will be assigned as the label.

If you choose, you may also manually enter a label for a Data Element by simply typing it into the text box.

  1. Here we've selected the "Purchase Order Number" Data Field and entered "PO Number".
  2. This will correspond to the label "PO Number" on the document itself.

  1. If you type in the label incorrectly with typos and the label is unable to match anything on the document, an error icon will appear next to the label rather than a check mark.

An error icon will appear any time whatever is in the label field does not match with any OCRed data on the document. This can be due to a typo on the label or OCR error.

  1. Continue lassoing or manually entering labels until all are collected.
  2. Next, we will focus on collecting labels from tables and table columns (the Data Table and Data Column elements in a Data Model). The process is essentially the same, but bears some extra explanation.

Click me to return to the top

Collect Table and Column Labels

Table and column labels can be used for tabular data extraction as well, setting a Data Table object to use the Tabular Layout Extract Method.

When collecting labels for this method of table extraction, keep in mind you must collect the individual column headers, and may optionally collect both the full row of column header labels as well.

While it is optional, it is generally regarded as best practice to capture the full row of column header labels. This will generally increase the accuracy of your column label extraction. We will do both in this tutorial.

  1. We will collect the full row of column header labels for the Data Table object's label.
  2. We will collect each individual column header label for each individual Data Column object's label.

This may seem like you are duplicating your efforts but it is often critical to do both in order for the Tabular Layout Extract Method to map the table's structure and ultimately collect the table's data.

  • In particular if you are dealing with OCR text data containing inaccurate character recognition data, establishing the full header row for the table will boost the fuzzy matching capabilities of the Labeling Behavior.

  1. To collect the Data Table's label, select the Data Table object in the Labels tab.
    • Here, we've selected the Data Table named "Line Items".
  2. Lasso the entire header row for the table.
    • You may notice there are more columns on this table than we are collecting. As it is on the document, the table has six columns. But we're only collecting four, the "Quantity", "Description", "Unit Price", and "Line Total" Data Columns.
    • Generally, you should collect the whole row of column headers, even if there are extra columns whose data you are not collecting.

  1. Next, collect each child Data Column's header label.
    • Here, we've selected the "Quantity" Data Column.
  2. Lasso the individual column header for the selected Data Column.
    • Here, the stacked label, "Qty. Ord.".

  1. Continue collecting labels for the remaining Data Columns.
  2. We have four Data Columns for this Data Table. Therefore, we collect four header labels from the document.

Click me to return to the top

Auto Map Labels

As you add labels for each Document Type, you may find some documents have labels in common. For example, there are only so many ways to label an invoice number. It might be "Invoice Number", "Invoice No", "Invoice #" or even just "Invoice". Some invoices are going to use one label, others another.

When collecting labels for multiple Document Types you can use the "Auto Map" feature to automatically add labels you've previously collected on another Document Type.

  1. So far, we've only collected labels for one Document Type, the "Factura" Document Type.
  2. Now, we're collecting labels for the "Lasku" Document Type.
  3. Press the "Auto Map" button to automatically assign previously collected labels,

Grooper will search the document's text for labels matching those previously collected on other Document Types.

  1. For example, we collected the label "Remit To:" for the "Remit Address" Data Field for the "Factura" Document Type. The "Auto Map" feature found a match for this label on the document and assigned the "Lasku" Document Type's "Remit Address" Data Field the same label.

If a match is not found, the Data Element's label is left blank.

  1. For example, the label for the "Invoice Amount" Data Field for the "Factura" Document Type was "Amount due".
  2. This label was nowhere to be found on this document. The invoice amount is labeled "Total" on the "Lasku" documents. So, the label is left blank for you to collect.

As you keep collecting labels for more and more Document Types, the Auto Map feature will pick up more and more labels, allowing you to quickly onboard new Document Types.

Be aware, you may still need to validate the auto mapped values and make adjustments.

  1. For example, the label "Date" is very generic.
  2. This label does actually correspond to the invoice date on the "Lasku" Document Type in this case.
  3. However, that could label some other date on another Document Type. Even on this document, the label "Date" is returning the "Date" portion of "Ship Date" and other instances where "Date" is found in the text.
    • As a side note, there are ways to make simple labels like "Date" more specific to the data they pertain to using "Custom Labels". More on that in the next tab.
  4. You can also make minor adjustments to the mapped labels.
    • The mapped label for the "Purchase Order Number" Data Field was "PO Number" (as it was collected for the "Factura" Document Type), but it is more specifically "PO Number:" on the "Lasku" documents. We can just add the colon at the end of the label manually.

Click me to return to the top

Collect Custom Labels

It's important to keep in mind labels are collected for corresponding Data Elements in a Data Model. You collect one label per Data Element (Data Field, Data Section, Data Table or Data Column). What if you want to collect a label that is distinct from a Data Element, one that doesn't necessarily have to do with a value collected by your Data Model? And why would you even want to?

That's what "Custom Labels" are for. Custom labels serve two primary functions:

  1. Providing additional labels for classification purposes.
  2. Providing context labels when a Data Element's label matches multiple points on a document

Custom Labels may only be added to Data Model, Data Section or Data Table objects' labels. Put another way, any Data Element in the Data Model's hierarchy that can have child Data Elements can have custom labels.

When used for classification purposes, custom labels are typically added to the Data Model itself.

  1. First select the Data Element in the Data Model's hierarchy to which you wish to add the label.
    • In our case, we're selecting the Data Model itself.
  2. Right-click either the Header or Footer tab.
  3. Press the "Add Custom Label..." button.
  4. The following dialogue box will appear.
  5. You may enter a name for the custom label, or use the default "Custom ##" naming convention.
  6. Press the "OK" button when finished.

  1. This will add a new label tab, named whatever you named it in the previous step.
    • Here, we kept with the default "Custom 01" name.
    • Notice the red "X" next to the name "Custom 01" as well. This indicates the label is not matching anything on the document. Currently the label is "Custom 01", which doesn't appear anywhere on the document. We need to change that by collecting a new label.
  2. Collect the custom label by either lassoing the text using the Document Viewer or manually typing in the label.
    • For example, the word "Invoice" might be a useful label for classification purposes. This label isn't used to collect anything in our Data Model, but might be helpful to identify this and other invoices from the Factura Technology Corp as "Factura" Document Types. Collecting the label "Invoice" as a Custom Label will allow us to use it as a feature of this Document Type for classification.

You may add more Custom Labels to the selected Data Element by repeating the process described above.

  1. Right-click any of the label tabs.
  2. Add a new label with the "Add Custom Label..." button.

Custom Labels as Context Labels

Some labels are more specific than others. The label "Invoice Date" is more specific than the label "Date". If you see the label "Invoice Date" you know the date you're looking at is the date the invoice was generated. The label "Date" may refer to the invoice's generation date or it could be part of another label like "Due Date". However, some invoice formats will label the invoice date as simply "Date".

  1. For example, the label "Date" on this "Factura" Document Type does indeed correspond to the invoice date for the "Invoice Date" Data Field.
  2. However, this label pops up as part of other labels too, such as the "Date" in "Due Date" or "Order Date".

This can present a challenge for data extraction. The possibilities for false-positive results tend to crop up the more generic the label used to identify a desired value. There are three separate date values identified by the word "Date" (in full or in part) on this document.

This is the second reason Custom Labels are typically added for a Document Type, to provide extra context for generic labels, especially when they produce multiple results on a document, leading to false-positive data extraction.

There are two steps to adding and using a Custom Label for this purpose:

  1. Add the Custom Label.
  2. Marry the Custom Label with the Data Element's label.

We will refer to this type of a Custom Label as a "Context Label" from here out.

The only "trick" to this is adding the Context Label to the appropriate level of the Data Model's hierarchy.

Remember, a Custom Label may only be added to a Data Model, Data Section or Data Table object. We cannot add a Custom Label to a Data Field, such as the "Invoice Number" Data Field.

To add a Context Label a Data Field can use, we must add the Custom Label to its direct parent Data Element.

  1. In the case of the "Invoice Date" Data Field its direct parent Data Element is the Data Model itself.
  2. Right-click the "Header" or "Footer" tab and select "Add Custom Label..." to add the Custom Label.

  1. The Custom Label we added was "Date Page".
  2. This will provide the simple label "Date" some extra context.
    • Which of the three results for the label "Date" do we want to accept? The one falling within this zone.

Now that we've added the label, we need to marry the Custom Label with the Data Field its giving extra context to. This is done with the Parent property of a Data Field label.

  1. In our case, the Custom Label provides extra context for the "Invoice Date" Data Field's label. We've selected the "Invoice Date" Data Field.
  2. Select the Parent property.
    • Note: This property is only present for Data Field and Data Column labels.
  3. Using the drop down list, select the Custom Label you wish to use for the Context Label.

  1. Notice with this Context Label added...
  2. ...We only return a single result for the "Invoice Date" Data Field's label "Date". This is the label we want to associate with this Data Field.
  3. The other two results do not fall within the Context Label, and are no longer returned.

Click me to return to the top


  1. To collect a Footer label, navigate to the "Footer" tab.
  2. Collect the Data Table's Footer label.
    • In our case we collected the text C. Services Borrower Did Shop For as our Footer label.
  3. Don't forget to "Save" when finished.

That's it! That's all you need to do to establish the table's header and footer. There is no need to collect labels for the Data Columns. Collecting labels for Data Columns is only necessary for the Tabular Layout method.

I will repeat. The Row Match method will only utilize the Data Table's labels. If you collect labels for the Data Columns and you're using the Row Match method, they will do nothing as far as table extraction goes. |valign=top| |} </tab>

Enable Label Sets

The only thing left to do is "tell" the Row Match method you want to use the Header and Footer labels. This is done by enabling the Use Labelset property.

  1. Navigate to the Data Table object in the Node Tree Viewer.
  2. Expand the Row Match sub-properties.
  3. Expand the Options properties.
  4. Change the Use Labelset property from False to True.

Test Results

With the labels collected and the Use Labelset property enabled, our Data Table will properly collect the rows we want from this table.

  1. Upon testing extraction of the selected document folder...
  2. Our Row Extractor collects the desired rows.
  3. Only rows coming after the Header label and before the Footer label are returned.

Click me to return to the top

</tabs>

The Fluid Layout Method

The Fluid Layout table extraction method is designed to switch between the Tabular Layout method and the Row Match method, depending on how a Data Table's labels are configured. So, if you have a varied set of documents where Tabular Layout works well for some Document Types and Row Match works well for other Document Types, you may be able to use Fluid Layout for all of them, avoiding the need for Data Element Overrides.

Label Sets must be collected to use the Fluid Layout method. Each Document Type will use either Tabular Layout or a Row Extractor to collect table data depending on how the labels for a Data Table are collected. Therefore, you cannot utilize the Fluid Layout method without a Labeling Behavior enabled.

The Fluid Layout table extraction method is not only "Label Set aware", it is Label Set dependent.

For example, take these two versions of code descriptions from an EOB form.

Version 1 is clearly a table. It uses the labels "CODE" and "DESCRIPTION" to delineate between each column. The Tabular Layout table extraction method would handily extract this information, returning everything in the "CODE" column to one Data Column and everything in the "DESCRIPTION" column to another..

Version 2 is not exactly a table, but a Data Table could still use the Row Extract method to form a table from this list of information. Each bulleted item in the list could be returned as a table row. The code could be filtered into one Data Column and the description could be filtered into another.

You could not use the Tabular Layout method for this "table". There are no column labels present.

  • There is, however, a header label for the whole table "Code", which will be important for the Fluid Layout method.

So, we have a situation where the Tabular Layout or the Row Match method is preferable, depending on the document's layout. Next, we will review how to configure the Fluid Layout table extraction method to target both table structures.

Collect Labels

The first thing you will want to do is collect labels for your Data Table for each document type. How the labels are collected will determine which table extraction method the Fluid Layout method executes.

  • To execute the Tabular Layout method, the Data Table's Data Column Header labels must be collected.
    • Optionally, you may choose to collect a Header and/or Footer label for the Data Table.
  • To execute the Row Match method (also referred to as the Flow Layout), you must collect the Data Table's Header label. You may NOT collect labels for the Data Table's Data Column labels.
    • This will be how Grooper checks to see which extraction method is used for each Document Type. If Data Column labels are present, the Tabular Layout configuration is used. If no Data Column labels are present, but the Data Table's Header label is present, it will use the Flow Layout (i.e. Row Match) configuration is used.
    • Optionally, you may choose to collect a Footer label for the Data Table.
  1. We will use this Content Model named "The Fluid Layout Method - Model" for this exercise.
    • Its Labeling Behavior has already been enabled.
  2. We have navigated to the "Labels" tab to start collecting labels.
  3. We have selected this Batch named "The Fluid Layout - Test Batch".
  4. Notice we have two Document Types
    • "V1 - Tabular Layout" will correspond to the document whose code description is in a proper table with column headers for the "CODE" and "DESCRIPTION" columns.
    • "V2 - Row Match" will correspond to the document whose code description is in a bulleted list.
  5. The two document folders in the Batch have already been assigned the appropriate Document Type.

For Tabular Layout Document Types

The "V1" Document Type will utilize the Fluid Layout method's Tabular Layout configuration. To execute the Tabular Layout configuration, much like executing the Tabular Layout table extraction method in general, Data Column labels must be collected.

  1. We have selected the "V1 - Tabular Layout" document folder in the Batch.
  2. The Header labels for the "Code" and "Description" Data Columns have been collected.
    • CODE for the "Code" Data Column
    • DESCRIPTION for the "Description" Data Column

FYI

Just as is the case with the Tabular Layout table extraction method as a "stand alone" extraction method, when used with Fluid Layout, collecting a Data Table's Header label is optional.

That said, it is still generally considered best practice to collect a row of header labels using the Data Table's Header label, if possible.

  1. Here, we've collected a Header Label for the "Code Remarks" Data Table.

For Flow Layout Document Types

The "V2" Document Type will utilize the Fluid Layout method's Flow Layout configuration. This will utilize the Row Match method to return table data. To execute the Flow Layout configuration ONLY the Data Table's label must be collected.

  1. We have selected the "V2 - Row Match" document folder in the Batch.
  2. The Header label for the "Code Remarks" Data Table has been collected.
  3. DO NOT collect Data Column labels for Document Types you wish to utilize the Row Match method to collect data.

Configure Fluid Layout

Now that the labels are collected for our Document Types we can configure the Fluid Layout extraction method for our Data Table.

  1. Select the Data Table in the Node Tree Viewer.
    • We've selected our "Code Remarks" Data Table.
  2. Select the Extract Method property.
  3. Using the dropdown menu, select Fluid Layout.

Expanding the Fluid Layout sub-properties, you can see there are two Layout configurations.

  1. The Tabular Layout configuration will be applied to Document Types whose Data Column labels have been collected.
  2. The Flow Layout configuration will be applied to Document Types whose Data Column labels have NOT been collected, as long as the Data Table's Header label is present.

By expanding the Tabular Layout and Flow Layout properties, you can see their property panels are identical to the Tabular Layout and Row Match table extraction methods respectively.

  1. The properties you see here are the same set of properties you configure for the Tabular Layout method.
  2. The properties you see here are the same set of properties you configure for the Row Match method.

All that's left is to configure extraction logic for each of the Layouts.

Configure Flow Layout

The Flow Layout configuration extracts table data using the Row Match method. What do you need in order for Row Match to collect table data? A Row Extractor.

  1. In our Local Resources folder, we already have a Data Type that will return rows properly for our "V2" Document Type.
  2. Using the Row Extractor property, we've referenced the aforementioned Data Type.

For our purposes, that's all we need to do. For the "V2 - Row Match" Document Types this extractor will properly return each row and collect each columns data. We have no need to configure any of the other Row Match properties.

Configure Tabular Layout

The Tabular Layout configuration extracts table data using the Tabular Layout method. What do you need in order for Tabular Layout to collect table data? At least one Data Column's Value Extractor must be configured in order to detect each row in the table.

  1. We've selected the "Code" Data Column to configure.
  2. For its Value Extractor we've used a Pattern Match extractor.
  3. The Pattern Match extractor's Value Pattern is set to the regex \w+ and its Prefix Pattern is set to the regex \n.
  4. This will return one result for each row of the "CODE" column, effectively detecting all four rows present.

This is a fairly simple table with only two columns. Just configuring one Data Column's Value Extractor will be sufficient for our needs.

  1. If you need to configure any additional Tabular Layout settings, you can do so by selecting the Data Table in the Node Tree Viewer.
  2. Expand out the Tabular Layout properties and configure them as needed.
    • Again, this is a simple table with simple configuration needs. These default property configurations should be adequate to collect table data for the "V1 - Tabular Layout" Document Types.

Test Extraction

Now that extraction is configured for both the 'Tabular Layout and Flow Layout for our documents, Grooper will switch between the Tabular Layout table extraction method and the Row Match table extraction methods, depending on the Document Type.

For the "V1 - Tabular Layout" Document Type, Data Column labels were collected.

Therefore Grooper extracts the table using the Tabular Layout configuration.

For the "V2 - Row Match" Document Type, only the Data Table's Header label was collected, and no Data Column labels were collected.

Therefore, Grooper extracts the table using the Flow Layout configuration (using the Row Match method).

Click me to return to the top

Use Label Sets for Sectional Extraction

There are two Label Set aware Extract Methods for Data Sections.

  1. Transaction Detection
  2. Nested Table

The Transaction Detection method will be most applicable to the majority of use cases wanting to use labels to produce section instances. If you simply want to produce a section starting at a header label and ending at a footer label, the Transaction Detection method is what you want. However, this configuration of Transaction Detection is quite different from how it normally produces sections. We will go over how Transaction Detection establishes section instances both with and without Label Sets.

The Nested Table method is a much more niche section extraction method. It produces section instances using repeating tables, nested within each section. This can be a highly effective way to target sections for certain use cases, such as medical EOB (explanation of benefits) forms.

Label Sets and the Transaction Detection Method

About Transaction Detection

The Transaction Detection section extraction method is useful for semi-structured documents which have multiple sections which are themselves very structured, repeating the same (or at least very similar) field or table data.

For example, take this monthly tax reporting form from the Oklahoma Tax Commission.

There are five sections of information on this document listed as "A" "B" "C" "D" and "E". Each of these sections collect the exact same set of data:

  1. A "Production Unit Number" assigned to an oil or natural gas well.
  2. A "Purchaser/Producers Report Number"
  3. The "Gross Volume" of oil or natural gas produced
  4. The "Gross Value" dollar amount of oil or natural gas produced
  5. The "Qualifying Tax Rate" ultimately used to calculate the tax due for the well's production.
  6. And so on.

The Transaction Detection method looks for periodic similarity (also referred to as "periodicity") to sub-divide a document into multiple section instances.

For structured information like this, one way you can define where each section starts and stops is just by the patterns in the fields themselves. These values are periodic. They appear at set intervals, marking the boundaries of each section.

For example,

  1. The "Production Unit Number" is always found at the start of the section.
  2. The "Exempt Volume" is always found somewhere in the middle of the section.
  3. The "Petroleum Excise Tax Due" is always found at the end.

The Transaction Detection method detects the periodic patterns in these values to divide up the document into sections, forming one section instance from each periodic pattern of data detected. Part of how the Transaction Detection detects these patterns is by using extractors configured in the Data Section's child Data Field objects. These are called Binding Fields.

Grooper uses the results matched by these Data Fields to detect each periodic section. For example, you might have a "Production Unit Number" Data Field for these section that returns five results, one for each section. Once these five results are established, Grooper will look for other patterns around these results to establish the boundaries of each of the five sections.

The Transaction Detection method also analyzes a document's text line-by-line looking for repeated lines that are highly similar to each other.

For example, each of the yellow highlighted lines are extremely similar. They are essentially identical except for the starting character on each line (either "A" "B" "C" "D" or "E"), this repeated pattern is a good indication that we have a set of repeated (or "periodic") sections of information.

Furthermore, the next lines, highlighted in blue, are also similar as long as you normalize the data a bit. If you replace the specific number with just a "#" symbol, they too are nearly identical.

The Transaction Detection method will further go line-by-line comparing the text on each one to subsequent lines, looking for repeating patterns. Such is the case for the rest of the green highlighted lines. Even accounting for OCR errors, each line is similar enough to detect a pattern. We have 5 sets of very similar lines of text. We have ultimately 5 section instances returned for the Data Section.

Lastly, eventually Grooper will detect a line that does not fit the pattern. The red highlighted line is totally dissimilar from the set of similar lines detected previously. This is where Grooper "knows" to stop. Not fitting the periodic pattern, this marks a stopping point. This text is left out of the last section instance and with no further lines matching the detected periodic pattern, no further section instances are returned.

The Transaction Detection method is not going to work well for every use case. It succeeds best where most of the data in the section is numerical in nature.

It's easy to normalize numeric data. Any individual number can be swapped for a "#" symbol. A currency value in on a line of text one section could be $988,000.00 and $112,433.00 in another but as far as comparing the lines for periodic similarity (also referred to as "periodicity"), they can both be normalized as "$###,###.##". Lexical data tends to be trickier. How do you normalize a name for example? How do you differentiate a name from a field label? You can do it with a variety of extraction techniques, but not using this line-by-line approach to determining how similar one line is to another.

This precisely is why it's called "Transaction" Detection. It works best with transactional data, which tends to be currency, quantity or otherwise numerical values. Indeed, this method was specifically designed for EOB (Explanation of Benefit) from processing and medical provider payment automation, in general.

FYI

What does this have to do with Labeling Behavior and Label Sets?

We're getting there. Ultimately, Transaction Detection is "Label Set aware" and can take advantage of collected Header and Footer labels for a Data Section object. However, collecting labels for the Data Section will quite dramatically change how Transaction Detection works.

It is best to understand how this sectioning method works without Label Sets before we delve into how it works with them.

Configuring Transaction Detection with Binding Fields

Without utilizing Label Sets, the Transaction Detection sectioning method must assign at least one Binding Field in order to detect the periodic similarity among lines of text in a document, ultimately forming the Data Section's section instances.

  1. For this example, we will end up configuring the "Production Info" Data Section of this Data Model.
  2. We will utilize the "Production Unit Number" as the Binding Field.
  3. This Data Field utilizes a simple Pattern Match extractor for its Value Extractor assignment.
    • Which returns the production unit numbers on the document using a simple pattern \d{3}-\d{6}-\d-\d{3}
  4. Importantly, notice this returns five result candidates (when testing extraction at the Data Field level in the Node Tree).
    • This will be important because we want to end up creating five section instances. If you expect to return five section instances, your Binding Field's extractor (or Binding Fields extractors if using more than one) will need to return five results.

Next, we will configure a the "Production Info" Data Section to create section instances using the Transaction Detection method.

  1. Select the Data Section in the Node Tree.
  2. Using the Extract Method property, select Transaction Detection.
  3. Select the Binding Fields property.
  4. Using the dropdown menu, select which Data Fields in the Data Section should be used as Binding Fields by checking the box next to the Data Field.
    • Here, we have selected the "Production Unit Number" Data Field.

For this example, all we need to do is assign this single Data Field as a Binding Field. There is enough similarity between the repeating section, that's all we need to do (For more complicated situations you may need multiple binding fields. Just be sure all Binding Fields are present in each section. No "optional" Data Fields for the Binding Fields.

The Transaction Detection method will then go through the line-by-line comparison process around the Binding Fields to detect periodic similarities to produce section instances.

  1. How Grooper goes about detecting these periodic patterns is controlled by the Periodicity Detection set of properties.
  2. In our case, five section instances were established, one for the each result from the "Production Unit Number" Data Field's Value Extractor.

FYI
1. If you need to trouble shoot the Transaction Detection method's results, the "Diagnostics" tab can give you additional information as to how Grooper detected these repeating patterns in the document's text data.
2. You will find the following reports for the Data Section
1. Execution Log
2. Preprocessed Document
3. Labels
4. Periodicity Matrix

Configuring Transaction Detection with Label Sets

Now that we understand the basics of the Transaction Detection method, we can look at how this sectioning method interacts with the Labeling Behavior. Its behavior is wildly different if a Header label is collected for the Data Section. Assuming you can collect a Header label for the Data Section, it is so different that a Binding Field is not even necessary to produce the section instances.

Establishing the section instances is almost as simple as...

  1. Start the section instance at the 'Header label.
  2. Stop the section instance at the next Header label (or Footer label)
  3. Repeat for every Header label found on the document.

For example, we have collected a Header label for the "Production Info" Data Section here.

  1. To add the label, we've selected the Content Model in the Node Tree.
  2. We've navigated to the "Labels" tab.
  3. We've selected the document in the Batch classified as the "OTC Form 300" Document Type.
    • In other words, this is the Label Set for the "OTC Form 300" Document Type.
  4. We've selected the Data Section in the Data Model.
  5. For the Header label, we've captured the first line of field labels.
    • 8. Production Unit Number 9. Purchasers/Producers Report Number 10. Gross Volume 11. Gross Value
  6. Notice we have five hits for this label, one at the start of each section.

Next, we will configure a the "Production Info" Data Section to create section instances using the Transaction Detection method.

  1. Select the Data Section in the Node Tree.
  2. Using the Extract Method property, select Transaction Detection.
  3. Notice no Binding Fields are selected.
  4. But we still get the five section instances returned!
    • In fact, for this example, no further configuration was required other than collecting the Data Section's Header label and setting the Extract Method to Transaction Detection.
  5. The section instance starts on the line containing the Header label.
  6. And it ends the line before the next Header label.
    • Then the second section instance starts at the second header and so on.

Click me to return to the top

Label Sets and the Nested Table Method


The Nested Table Data Section Extract Method was specifically designed for a particular combination of sectional and tabular data found on medical EOB (Explanation of Benefits) forms (However, it may have application in other areas). These documents will often be broken up into sections of claims with information like a claim number and a patient's name followed by a table of information pertaining to services rendered and ending with some kind of footer, usually a total line adding up amounts in the table.

One way you can often identify where these claim sections start and stop are by the table's themselves. Essentially you'll have one table per claim. Or in Grooper's case, one Data Table instance per Data Section instance.

The Nested Table sectioning method takes advantage of these kinds of structures, utilizing a Document Type's Label Set to do so.

The Nested Table method has two hard requirements:

  1. The Data Section must have a child Data Table object.
    • Furthermore, The Nested Table Data Section Extract Method was designed to work best when its child Data Table uses the Tabular Layout Extract Method.
    • While it is possible for this Data Table to use any tabular extraction method, the Tabular Layout method is preferred.
  2. The section must have a "footer" able to be captured as the Data Section's Footer label.

Set Data Section's Extract Method to Nested Table

The Nested Table method is a little different in that it is a sectional extraction method but also involves tabular data. Ultimately, both a Data Section object and a Data Table object are required for it to work. However, it is primarily a method of breaking up a document into multiple sections for data extraction purposes. As such, it is a Data Section extraction method.

  1. Select a Data Section in the Node Tree.
  2. Set the Extract Method property to Nested Table.
  3. Using the Table property select the child Data Table to be used to establish the repeated sections in the document.
  4. In our case, we've selected the highlighted child Data Table named "Service Line Info".
    • This Data Table's extraction results, combined with the Data Section's Footer label we will collect label, will form the multiple section instances for the Data Section.
    • The Data Table must be a direct child of the Data Section.

Configure the Data Table

The Data Table should be configured to collect all table rows for the full document. When configuring the Data Table and testing its results, just ensure the table accurately extracts the full document as a single table. The Data Section (using the Nested Table method) will take care of breaking out the rows into the appropriate sections.

It is considered best practice for the child Data Table to use the Tabular Layout method.

The Nested Table method was designed specifically in mind with with a Data Table using the Tabular Layout table extraction method. While technically possible to use other table extraction methods, you will achieve the best results when the Data Table uses the Tabular Layout method.

  1. Select the Data Section's child Data Table.
    • If you have more than one child Data Table, make sure you select the Data Table referenced by the Nested Table method's Table property.
  2. Assign the Data Table's Extract Method and configure its extraction.
    • In our case we have enabled the Tabular Layout method, having already collected labels for the Data Table and its child Data Columns for the "Astro Insurance" Document Type.
  3. Press the "Test Extraction" button to ensure the Data Table collects all rows on the document.
  4. Note, we get one big table since we're testing on the Data Table object of our Data Model's hierarchy.
  5. Even though we technically have three distinct tables on the document in three sections, one for each patient.
    • This is good for now. This is what we want to verify, that we get the correct data populated for every table row. Sectioning out the document and placing each table in its own section will be performed by the parent Data Section using the Nested Table method.

FYI The general guidance for testing the child Data Table's extraction results is to verify every row on the document is returned as a single table.

However, there are some situations where you may not return every row but still get Nested Table to section the document appropriately and ultimately return each table to each established section.

You may run into this if your table uses a Footer label or extractor. To illustrate this, we added a Footer label for this Data Table, using the text label "TOTALS" at the bottom of each table.

  1. Notice we only return the first three rows for the first table on this document where before we were returning every table row for every table.
  2. That is because the Data Table stopped looking for rows once it encountered the Footer label TOTALS
  3. However, because that footer is present at the end of each table (which will ultimately inform the Nested Table method how to establish sections), when the Data Section executes, creating sections for this document, each subsequent table will actually successfully extract and populate the sections as seen in the subsequent tabs.

To sum up, in general, make sure your Data Table extracts every row for the whole document when testing your Data Table's configuration. However, if you have repeating footers, while you may only be able to verify the first section's table row's populated correctly, you will be able to verify the rest of the document's tables extract correctly whenever you execute the Data Section's extraction.

Add the Data Sections Footer Label

In order for the Nested Table method to properly section out the document, you must assign a Footer label in the Document Type's Label Set for the Data Section. This will give Grooper an awareness of where the section should stop (ultimately allowing another section to start). In our case, we can use the text label "Totals". At the end of every table/section there is this "Totals" line totaling up various columns in the table. Since this label is present at the end of every section, we can collect it as the Data Section's Footer label, which the Nested Table method will then use to establish where each section instance ends.

  1. To collect the Footer label, first navigate to the Content Model in the Node Tree.
  2. Select the "Labels" tab.
  3. Select the Data Section object in the Labels collection UI.
  4. Select the Footer tab.
  5. Collect the label.
    • See the Collect Label Sets section of this article if you need more information on how to collect labels.
The Footer label must be collected for the Data Section for the Nested Table method.

Even if the Data Table object collects and uses a Footer label for its own needs, the Data Section must have a Footer label defined as well (even if it's the exact same label).

Run Global Header Detection

If we are to test extraction at this point, we will see mixed results. The Data Section will correctly produce the three section instances for this document. However, the tabular data will not be collected.

  1. Selecting the Data Section in the Node Tree
  2. and pressing the "Test Extraction" button, we can verify these results.
  3. The Nested Table method populates three section instances, as desired.
  4. The Data Field extraction executes flawlessly.
    • Not only do we have the right number of section instances, but their dimensions are correct as well. They fully encapsulate the appropriate text data.
  5. However, the Data Table does not extract anything at all.

As a side note, the "Inspector" tab can be very helpful when troubleshooting extraction in general, but particularly Data Section and Data Table extraction.

  1. Select the "Inspector" tab.
  2. Expand the "Claim Section" Data Section result to inspect the individual section instance results.
  3. Select one of the section instances.
  4. You can visually see the dimensions of the instance in the "Image View" tab.
    • It is likely hard to see in this image. However, the section will be outlined in a red line.
  5. You can also use the "Text View" tab to see all the text data for the selected section instance.

If you were to select each section instance, you could verify at this point all three sections where established successfully and the text data for each table is present. It just wasn't extracted. Why not?

This has to do with where the sections instances are and where the Data Table and Data Column labels are.

  1. This is where the section instances where determined to be on the document.
    • And this is absolutely correct. This is what we wanted to happen.
  2. However, the Data Table and Data Column labels for this Document Type's Label Set fall outside of each of these sections.

This presents a challenge. The Tabular Layout table extraction method relies on these labels in order to extract the tabular data. As is, Grooper can't "see" outside of the section instances. If only Grooper could look up to the table and column labels, the table would extract with ease.

Luckily, there is a way for the Tabular Layout method to do just that, using the Run Global property.

  1. Navigate to the child Data Table object in the Node Tree.
  2. Expand the Tabular Layout properties.
  3. Expand the Header Detection properties.
  4. Change the Run Global property from False to True.

This will allow Grooper to detect Data Table and Data Column labels outside of the section instances. Perfect for what we're trying to do here.

What's going to happen when we test extraction now? Find out in the next tab!

Test For Success

With the child Data Table now using global header detection (by setting the Run Global property to True), it can look outside the section instances for the column header labels on the full document. Let's see how our sections extract now and if we get any table data populated.

  1. We're navigating back to the parent Data Section to test extraction.
  2. Upon testing extraction, the Nested Table method establishes the three section instances, as we've seen before.
  3. And now, the table data is extracted.

Success! The Run Global property method is extremely beneficial when trying to extract table data from multiple sections. Without it, Tabular Layout would not have any way of referring to the column header labels collected in the Label Set. With this property enabled, Tabular Layout can do something very atypical for sectional data extraction. It can look beyond a section instance's text data and refer to the full document (in this case to locate the Data Table and Data Column labels in the Label Sets).

FYI

You can utilize the Run Global property to this effect when using other Data Section Extract Methods as well, not just Nested Table.

However, do note this property is only available/applicable to Data Tables using the Tabular Layout Extract Method.

Bonus Info: Hierarchical Tables and Peer Parent Labels

Click me to return to the top

Additional Information

Label Layout Options

As we've been collecting labels, you may have noticed the Layout property change from Simple to Tabbed or Wrapped. The Layout property determines how the label's text is presented on the document. The Layout can be one of the six following options:

  • Simple
  • Tabbed
  • Substring
  • Boxed
  • Wrapped
  • Ruled

When collecting labels in the Labels tab, Grooper will automatically detect the appropriate label layout. However, there may be some circumstances where you need to manually select the label's layout. Next, we will describe each of these Layout options.

Simple

The Simple Layout is by far the most common. Most fields on a document will utilize this layout. These labels consist of words that do not cross "segment boundaries", meaning the words themselves are not separated by large amounts of whitespace like tabs or by terminal characters like a colon (as a colon character often marks the end of a label).

  1. We will use this Content Model named "Label Layouts - Model" which already has a Labeling Behavior enabled.
  2. We will use these Data Fields to capture various labels and describe their corresponding Layout specifications.
  3. We will use this Batch named "Label Layouts - Test Batch".
  4. For the "Simple" Data Field we captured this portion of text.
  5. Grooper automatically detected its Layout as Simple.
  6. Notice the text here was not returned as a label.
    • Even thought the words are the same Im a Simple Label, its layout is different. There is a large gap between Simple and Label. The Simple Layout does not permit this. Hence, only the label to the left is returned. Only the label whose words do not cross a segment boundary, like a tab, are returned.
    • For simple fields, this makes sense. You don't expect a single label to cross these segment boundaries. Large gaps in whitespace or terminal characters, like colons, are used to distinguish one label from the next. The Simple Layout utilizes this affect of document structure to toss out labels whose words do cross these boundaries, preventing false positive label matches.

Tabbed

The Tabbed Layout is used for situations where you do want to cross segment boundaries. Think about capturing a table's row of header labels. Often each column's label will be separated by large amounts of whitespace. The Simple Layout would not permit you to capture the table's header but Tabbed will.

  1. For the "Tabbed" Data Field, we captured this portion of text.
    • Notice the large gap between Tabbed and Label. However, the label still matches.
  2. This is because Grooper automatically detected this as a Tabbed Layout.
  3. Notice this text also returned, even though it does not have large whitespace gaps between any of its words.
    • The Tabbed Layout does not expect words to cross segment boundaries, it merely permits them. You can think of this Layout as a more permissive version of Simple. Tabbed doesn't care whether there's segment boundaries between labels, whereas Simple does.

Substring

The Substring Layout is intended for circumstances where a label is bookended between other portions of text. In other words, it is a "substring" of a larger string of text.

  1. For the "Substring" Data Field, we captured this portion of text.
  2. Notice Grooper automatically detected the layout as Simple.
  3. Furthermore, this label does not match, even though its text Substring Label: is identical.
    • This is a good example of a substring. There are portions of text before and after it (in both cases the word Value). This prevents the label from being returned, using the Simple Layout.

This is a situation where we would want to manually assign the label's Layout, if we want to collect substrings as labels.

  1. To do this, we will select the Layout property.
  2. Using the dropdown menu, we will select Substring.
  3. With the Data Field's label using the Substring Layout, the substring now matches.
  4. Furthermore, notice both the substring and the simple string match, using the Substring Layout.
    • Again, you may think of this as a more permissive version of the Simple Layout.

Wrapped

The Wrapped Layout will return labels that wrap full lines of text on a document. So, if a label starts on one line, then continues on one or more lines after, this layout will successfully return it. The Wrapped Layout was also useful when we were collecting table labels for the entire header row. For those tables who had column headers on multiple lines, this layout was most appropriate to return the whole row of column headers.

  1. However, the Simple Layout will not work to capture this portion of text as a label.
  2. Normally, Grooper would capture this text as a Wrapped 'Label, but we manually assigned it the Simple Layout.
  1. For the "Wrapped" Data Field, we captured this portion of text.
  2. Grooper automatically detected this as a Wrapped label.
    • This is because the label wraps full lines. This line occupies one full line on the document, and wraps occupies the next.

  1. As a side note, if we manually change this Layout to Simple, the label will still return.
  2. The text matches in this case due to the Vertical Wrap property enabled on the Labeling Behavior, which allows the Simple layout to capture stacked labels..
    • See the Vertical Wrap article for more information on this property.

  1. The label will match when the Layout is set to Wrapped.
  2. The reason why is the first line So does is not stacked on top of the second this one. However, the first line does wrap to the next.

  1. Furthermore, the Wrapped Layout does permit segment boundaries between text segments, such as the tab space between about and this in this portion of text.

Ruled

Lines are used on documents to divide fields, sections, table columns or otherwise distinguish between one piece of information and another. Because of this, it is atypical to find a stacked label with a line between the first and second label. The Simple Layout respects this by preventing labels from returning if a horizontal line falls between any portion of the stacked label.

However, there may be rare circumstances where a horizontal line does fall between portions of the stacked label. In that case, you will want to use the Ruled Layout.

Line location information must be present in the Layout Data in order for Grooper to determine if a line is present. A Line Detection or Line Removal command must have been previously executed by an IP Profile during Image Processing or Recognize to obtain this information.
  1. For the "Ruled" Data Field, we captured this portion of text.
  2. Grooper automatically detected this as a Ruled label.
    • This is because there is a horizontal line between the first line of the stacked label This line and the second line rules.
  3. The text matches in both cases whether or not there is a horizontal line between any portion of the stacked label.

  1. If, however, we change the Layout to Simple, only the text without the line between the stacked label will return.
  2. Since there is a line present between the first and second line, the label no longer returns.

If you want to use the Ruled Layout option, you must enable the Vertical Wrap property of a Content Model

1. The Vertical Wrap property is the last configurable property in the Labeling Behavior property panel.

If you attempt to collect a Ruled label without Vertical Wrap enabled or attempt to change a Layout to Ruled without Vertical Wrap enabled, you will get an error message.

•  Again, keeping Vertical Wrap enabled is highly preferable. Only disable this property if absolutely necessary. Keeping Vertical Wrap enabled will avoid errors like this.
•  FYI: In older minor builds of version 2021, this will be an unhandled exception error message, reading "Object reference not set to an instance of an object", as seen in this screenshot.

Boxed

The Boxed Layout is intended to capture labels that wrap inside a box, enclosed in lines on all four sides. You can use this Layout to distinguish between labels that fall inside a box and those that do not when the Vertical Wrap property is disabled.

Line location information must be present in the Layout Data in order for Grooper to determine if a line is present. A Line Detection or Line Removal command must have been previously executed by an IP Profile during Image Processing or Recognize to obtain this information.
  1. For the "Boxed" Data Field, we captured this portion of text.
  2. With the default Labeling Behavior properties, Grooper automatically detected this as a Simple label.
  3. Both the label inside a box, and the label outside a box match.

You can differentiate between a label in a box and one outside a box by disabling the Vertical Wrap property.

  1. Here, we've gone into the Labeling Behavior property grid and set the Vertical Wrap property to Disabled
    • CAUTION! If you do this, vertical wrapping will be disabled for all labels and all label Layouts, not just for the Boxed Layout.

  1. Switching back to the Labels tab, our label now does not match at all.
  2. With Vertical Wrap disabled, the Simple Layout will not be able to match the stacked labels.

  1. However, if we change the Layout to Boxed, we will get a match.
  2. The label wrapped inside a box will match.
  3. The label that is not wrapped inside a box will not match.

If you want to use the Boxed Layout option in the manner described above, you must enable the Constrained Wrap property of a Content Model

1. The Constrained Wrap property is the next-to-last configurable property in the Labeling Behavior property panel.

If the Constrained Wrap property is disabled and Vertical Wrap is also disabled, you will not be able to return labels inside a box with the Boxed Layout.

Click me to return to the top of this tab

Data Element Override Utility

Earlier in this article, we talked about using the Labeled Value Extractor Type without configuring its Value Extractor. Again, it is considered best practice to configure its Value Extractor. However, sometimes data is difficult to pattern match. For example, crafting an extractor to return people or company names can be difficult to craft. It is truly these cases why the option to leave a Labeled Value extractor's Value Extractor unconfigured is an option with Label Sets.

To make the best use of this functionality, Data Element Overrides are typically necessary. Indeed, because the Label Set approach is more templated in nature, Data Element Overrides can be a useful tool to fine tune extraction for one specific Document Type. In this section, we will use the "Purchase Order Number" Data Field of our "Labeling Behavior - Invoices - Model" Content Model to demonstrate this.

Revisiting the Problem

The problem arose due to how the Labeled Value extractor behaves when its Value Extractor is left unconfigured. For some of our invoices, this didn't really present a problem at all.

  1. Here, we have the "Purchase Order Number" Data Field selected in our "Labeling Behavior - Invoices - Model" Content Model.
  2. The Data Field's Value Extractor is set to Labeled Value, as is appropriate to utilize the label collected for each Document Type in their Label Sets.
  3. We have re-set the Labeled Value extractor's Value Extractor. It is unconfigured.

For certain document layouts, this approach works just fine.

  1. Here, we have selected a "Rechnung" Document Type folder in the "Labeling Behavior - Invoices - Test Batch" Batch.
  2. Upon testing extraction...
  3. The correct value appropriately extracts.
    • This is due to the special functionality of the Labeled Value extractor when using Label Sets and leaving its Value Extractor unconfigured. The extractor will return text segments to the right of the Data Field's collected label.

However, this will not be the case for all document layouts, notably those whose labels are stacked vertically on top of their corresponding value.

  1. Here, we have selected a "Factura" Document Type folder.
  2. Upon testing extraction...
  3. We do not get the right value.
    • Without a Value Extractor configured, the Labeled Value extractor will return text segments to the right of the label, which, in this case, is not the right text data.
  4. Notice the Maximum Distance settings for this extractor. The extractor will return text results a maximum of 2 inches to the right and 2 inches below.
  5. So, why isn't it returning the actual purchase order number? It certainly falls within 2 inches below the label.
    • Again, this is due to the specialized way Labeled Value works without a Value Extractor configured. It will always return text data to the right of the label if any is present within the Right Maximum Distance setting.

However, we can easily get this extractor to return the actual purchase order number. All we have to do is tell it not to look to the right of the label.

  1. We can expand the Maximum Distance property.
  2. Clear the Right property value (or you could set it to 0 in as well).
    • This will ensure the Labeled Value extractor will only return text below the label, ignoring the text to the right of it.
  3. Test extraction.
  4. And we get the value we want.

But, what about our documents that did have the purchase order number laid out to the right of the label?

  1. If we go back to our "Rechnung" Document Type folder...
  2. ..and test extraction...
  3. Now, we get a result we don't want.

Data Element configurations are globally applied to all Document Types which inherit them. In our case, all our Document Types inherit the Content Model's Data Model (and its child Data Elements, such as our "Purchase Order Number" Data Field). Therefore, the changes we make to the "Purchase Order Number" Data Field's extractor will effect all documents of all Document Types. It's simply going to execute as we configure it, regardless which specific Document Type is extracted.

We're really in a situation where we want one Document Type to use one configuration and another Document Type to use a slightly different configuration. This is exactly what "Data Element Overrides" are for.

Data Element Override Basics

Before we get into setting up "Data Element Overrides", we will rewind a bit and set our Labeled Value extractor's Maximum Distance properties back to the default settings.

  1. We've reset the Maximum Distance properties to their default values.
  2. So we're back to square one. We are no longer getting the correct value for this "Factura" Document Type folder.

What we want to do here is change how these properties as configured for the "Purchase Order Number" Data Field are configured for the "Factura" Document Type and ONLY for the the "Factura" Document Type. "Data Element Overrides" allow us to do this by overriding a Data Element's property settings for a specific Document Type (in our case the "Purchase Order Number" Data Field for the "Factura" Document Type.).

"Data Element Overrides" are configured using the Document Type object to which they will be applied. We will thus configure an override for the "Factura" Document Type.

  1. Select the "Factura" Document Type in the Node Tree Viewer.
  2. Navigate to the "Overrides" tab.
  3. Here you will see all Data Elements the selected Document Type inherits.
    • Important! Overrides are configured in this UI for the Document Type selected in the Node Tree Viewer NOT the Batch Viewer.

  1. Select the Data Element whose properties you wish to override.
    • In our case, we want to change the property configuration of the "Purchase Order Number" Data Field.
  2. Navigate to the "Property Overrides" tab.
  3. What you see here is a duplication of the selected Data Element's property grid.
    • This is how the "Purchase Order Number" Data Field is currently configured. If we navigated back to that object in the Node Tree Viewer, we would see the exact same property configuration as we see here. For example, we can see the Value Extractor property is set to Labeled Value, just as it is on the object itself.

Using the "Property Overrides" UI, any property configuration we edit will only apply to the selected Document Type (in our case, the "Factura" Document Type).

  1. Expand the Labeled Value extractor's sub-properties
  2. Expand the Maximum Distance property's sub-properties.
  3. Clear the Right property's value (or set it to 0 in).

Now the "Purchase Order Number" Data Field will extract using these settings, only for the "Factura" Document Type.

  1. FYI: Any overridden Data Element will be underlined. Its text will also be blue once we navigate off this Data Element.

  1. If you have a "Factura" Document Type folder selected in the selected Test Batch...
  2. You test extraction for the overridden Data Elements
  3. Now we get extraction results using the overridden property configuration.

  1. Going back to the "Purchase Order Number" Data Field we can verify the override only effects the "Factura" Document Type.
  2. Selecting our "Rechnung" Document Type folder...
  3. ...and testing extraction...
  4. We get the result we want.
  5. The "Rechnung" Document Type has no overrides configured, and thus uses the Maximum Distance configuration as-is for the Data Field.

Data Element Overrides can be an effective way of fine tuning extraction logic specific to an individual Document Type. Because the Label Set approach is more templated in nature, each Document Type corresponds to one specific format, meaning the document's layout will be consisted for each folder classified as that Document Type. Many users will take advantage of this and leverage Data Element Overrides for various fields on various Document Types, especially when utilizing Label Sets.

There is a shortcut to configuring Data Element Overrides using the "Labels" collection UI, which we will demonstrate in the next tab.

Overrides & the Labels UI

In the previous tab, we taught you the normal way to configure Data Element Overrides for a Document Type. You can configure overrides in this manner whether or not you're using a Labeling Behavior in your Content Model. If you are using a Labeling Behavior, there is a shortcut to edit overrides for a Data Element. You can do it directly from the "Labels" tab, using the same UI you use to collect labels.

  1. Here, we have selected the "Labeling Behavior - Invoices - Model" Content Model.
  2. We have also navigated to the "Labels" tab.
  3. We need an override for the "Envoy" Document Type.
  4. Just like the "Factura" Document Type, these invoices present the purchase order number below the label and not to the right.

So, we need an override for the "Purchase Order Number" Data Field for the "Envoy" Document Type, which we can do without leaving the Labels UI.

  1. Select a Batch Folder assigned the Document Type whose overrides you want to edit.
    • In our case, we will keep selected this "Envoy" Document Type folder.
    • Important! Overrides configured in the Labels UI are configured for the Document Type selected in the Batch Viewer. Since you're not manipulating a selected Document Type object in the Node Tree, this is how Grooper "knows" which Document Type's overrides you are editing.
  2. In the "Labels" editor, double click the name of the Data Element whose property configurations you wish to overrides.
    • In our case, we want to edit the "Purchase Order Number" Data Field's override. So, we double click "Purchase Order Number"
  3. This will bring up a window to edit the double-clicked Data Element's overrides.
  4. Just like before, this is a duplication of the Data Element's property grid. Any adjustments you make to the Data Element will execute only for the Document Type selected.
    • For example, we can clear out the Labeled Value extractor's 'Right Maximum Distance property, forcing the Labeled Value to only "look" for text below the label.
    • FYI: You can override ANY property for ANY Data Element. For example, you could use a completely different extractor type for a Data Field for a specified Document Type.
  5. Press "Ok" to save the override configuration.

  1. Overridden Data Elements will be underlined and appear in blue text in the Labels UI as well.

Furthermore, you can test the override directly from the Labels UI as well. You can actually test extraction for the whole Data Model!

  1. Press the "Test" button to test the Data Model's extraction for the selected document folder in the Batch Viewer (including override settings for any Data Elements).

  1. This will create a "Results" tab. You will be presented with the extraction results for the selected document folder's Data Model.
  2. If there are any Data Elements with overrides, such as our "Purchase Order Number" Data Field, they will extract using the override configuration (just as they will when the document is actually extracted by the Extract activity in a Batch Process).

Click me to return to the top of this tab

Version Differences

2021

The Labeling Behavior is brand new functionality in Grooper version 2021. Prior to this version, its functionality may have been able to be approximated by other objects and their properties (For example, a Data Type using the Key-Value Pair collation is at least in some ways similar to how the Labeled Value Extractor Type works). However, creation of label sets using Document Types and their implementation described above was not available prior to version 2021.