2021:Labeling Behavior (Behavior): Difference between revisions

Revision as of 14:45, 1 April 2021

2021

This article is in development for the upcoming version of Grooper, Grooper 2021. Labeling Behavior is a new Content Type Behavior option in 2021. This information is incomplete and/or may change by the time of release.

The Labeling Behavior is a Content Type Behavior designed to collect and utilize a document's field labels in a variety of ways. This includes functionality for classification and data extraction.

The Labeling Behavior functionality allows Grooper users to quickly onboard new Document Types for structured and semi-structured forms, utilizing labels as a thumbprint for classification and data extraction purposes. Once the Labeling Behavior is enabled, labels are identified and collected using the "Labels" tab of Document Types. These "Label Sets" can then be used for the following purposes:

Document classification - Using the Labelset-Based Classification Method
Field based data extraction - Using the Labeled Value Extractor Type
Tabular data extraction - Using a Data Table object's Tabular Layout Extract Method
Sectional data extraction - Using a Data Section object's Transaction Detection Extract Method

About

Labels serve an important function on documents. They give the reader critical context to understand where data is located and what it means. How do you know the difference between the date on an invoice document indicating when the invoice was sent and the date indicating when you should pay the invoice? It's the labels. The labels are what distinguishes one type of date from another. For example, "Invoice Date" for the date the invoice was sent and "Due Date" for the date you need to pay by.

Labels can be a way of classifying documents as well. What does one individual label tell you about a document? Well, maybe not much. However, if you take them all together, they can tell you quite a bit about the kind of document you're looking at. For example, a W-4 employee withholding form is going to use different labels than an employee healthcare enrollment form. These are two very different documents collecting very different information. The labels used to collect this information are thus different as well.

Furthermore, you can even tell the difference between two very closely related documents using labels as well. For example, two different invoices from two different vendors may share some similarity in the labels they use to detail information. But there will be some differences as well. These differences can be useful identifiers to distinguish one from the other. Put all together, labels can act as a thumbprint Grooper can use to classify a document as one Document Type or another.

Even though these two invoices share some labels (highlighted in blue), there are others that are unique to each one (highlighted in yellow). This awareness of how one kind of invoice from one vendor uses labels differently from another can give you a method of classifying these documents using their label sets.

The Labeling Behavior is built on these concepts, collecting and utilizing labels for Document Types in a Content Model for classification and data extraction purposes.

As a Behavior, the Labeling Behavior is enabled on a Content Type object in Grooper.

⚠	While you can enable Labeling Behavior on any Content Type, in almost all cases, you will want to enable this Behavior on the Content Model.

Here, we have selected a Content Model in the Node Tree.
To add a Behavior, select the Behaviors property and press the ellipsis button at the end.
This will bring up a dialogue window to add various behaviors to the Content Model, including the Labeling Behavior
Add the Labeling Behavior using the "Add" button.
Select Labeling Behavior from the listed options.

Once added, you will see a Labeling Behavior item added to the Behaviors list.
Selecting the Labeling Behavior in the list, you will see property configuration options in the right panel.
- The configuration options in the property panel pertain to fuzzy matching collected labels as well as constrained and vertical wrapping capabilities to target stacked labels.
- By default, Grooper presumes you will want to use some fuzzy matching and enable constrained and vertical wrapping. These defaults work well for most use cases. However, you can adjust these properties here as needed.
Press the "OK" button to finish adding the Labeling Behavior and exit this window.

Once the Labeling Behavior is enabled, the next big step is collecting label sets for the various Document Types in your Content Model.

With the Labeling Behavior enabled, you will now see a "Labels" tab present for the Content Model.
- This tab is also now present for each individual Document Type as well.
Label sets are collected in this tab for each Document Type in the Content Model.

Each Document Type has its own set of labels used to define information on the document. For example, the "Factura" Document Type in this Content Model uses the label "PO Number" to call out the purchase order number on this invoice document. A different Document Type, corresponding to a different invoice format, might use a different label such as "Purchase Order Number" or "PO #".

Ultimately, this is the data we want to collect using the Content Model's Data Model.
We use the "Labels" tab to collect labels corresponding to the various Data Elements (Data Fields, Data Tables, and Data Sections) of the Data Model.
- This provides a user interface to enter a label identifying the value you wish to collect for the Data Elements.
For example, the label "PO Number" identifies the purchase order number for this invoice.
Therefore, the label "PO Number" is collected for the "Purchase Order Number" Data Field in the Data Model.

For more information on collecting label sets for the Document Types in your Content Model see the How To section of this article.

Once label sets are collected for each Document Type, they can be used for classification and data extraction purposes.

For example, labels were used in this case to:

Classify the document, assinging it the "Factura" Document Type.
Extract all the Data Fields seen here, collecting field based data from the document.
Extract the "Line Items" Data Table, collecting the tabular data seen here.

For more information on how to use labels for these purposes, see the how to section of this article.

How To

Collect Label Sets

Navigate to the Labels UICollect Field LabelsCollect Table and Column LabelsAuto Map LabelsCollect Custom Labels

Navigate to the Labels UI

Collecting labels for the Document Types in your Content Model will be the first thing you want to do after enabling the Labeling Behavior. Labels for each Data Element in the Document Type's Data Model are defined using the "Labels" tab of the Content Model.

Navigate to the "Labels" tab of the Content Model.
With a Batch selected in the "Batch Selector" window panel, select a document folder.
Press the "Set Type..." button to set the Document Type whose labels you wish to collect.
This will bring up the "Set Content Type" window.
From this window, select the Document Type for the selected document folder whose labels you wish to collect.
- In this case, this document is an invoice from "Factura Technology Corp". We have selected the "Factura" Document Type.
Press "OK" to finish.

FYI

If you haven't added a Document Type for the selected document folder yet, you can use the "Create Type" button instead to both create a new Document Type and set it.

Upon setting the Document Type the document folder is assigned the selected Document Type
- Or in other words, this document is now classified as a "Factura" document.
Upon setting a Document Type, that Document Type's Data Model and its child Data Elements will appear in the label collection UI.
- Labels are primarily collected as they correspond to Data Elements in a Data Model. However, we will see how to add custom labels that don't correlate to a Data Element as well by the end of this tutorial. Custom labels are often used as additional features for document classification.

Collect Field Labels

Now that this document has been classified (assigned a Document Type from our Content Model), we can collect labels for its Document Type. This can be done in one of two ways:

Lassoing text in the "Document Viewer"
Typing them in manually.

‼	Going forward, this tutorial presumes you have obtained machine readable text from these documents, either OCR'd text or native text, via the Recognize activity.

Generally the quickest way is by simply lassoing the label in the "Document Viewer".

Select the Data Element whose label you wish to collect.
- Here, we are selecting the "Invoice Number" Data Field.
Press the "Select Region" button.
With your cursor, lasso around the text label on the document.

Upon lassoing the label in the Document Viewer, the OCR'd or native text behind the selected region will be used to populate the Data Element's label.
- At this point, the label for the "Invoice Number" Data Field is now "Invoice Number" because that's the text data we selected. Whatever text characters you lasso with your cursor will be assigned as the label.
Notice this label also now appears in the "Header" tab below. That's because we had the Header tab selected when we lassoed the label.
- The text collected here ("Invoice Number") is the Header label for the "Invoice Number" Data Field.
- We'll talk about the difference between Header, Footer, and Static labels later. This will be important when using labels for data extraction purposes.

If you choose, you may also manually enter a label for a Data Element by simply typing it into the text box.

Here we've selected the "Purchase Order Number" Data Field and entered "PO Number".
This will correspond to the label "PO Number" on the document itself.

⚠	Whether lassoing the text using the Document Viewer or manually typing into the textbox, you may collect a maximum of one Header label and one Footer label (and one Static label where avaialable) per Data Element per Document Type.

Upon entering the label into the text box, just you'll see the label in the Header tab, just like we saw when we collected a label by lassoing the text on the Document Viewer.
Notice as well, there is a green checkmark next to the "Header" tab (and the box below is highlighted green).
- This means the text label is matching something on the document. If it did not, you would see a red "X" next to the Header tab and the box below would be highlighted red.
Also note, since this label is being returned on this document, we can verify it in the Document Viewer. The selected Data Field ("Purchase Order Number") and it's text label are highlighted green on the document, indicating 1) it was successfully located on the document and 2) where it was located.

Continue lassoing or manually entering labels until all are collected.
Next, we will focus on collecting labels from tables and table columns (the Data Table and Data Column elements in a Data Model). The process is essentially the same, but bears some extra explanation.

Collect Table and Column Labels

Table and column labels can be used for tabular data extraction as well, setting a Data Table object to use the Tabular Layout Extract Method. When collecting labels for this method of table extraction, keep in mind you need to collect both the full row of column header labels and each individual column header label as well. You will collect the full row of column header labels for the Data Table object's label. You will collect each individual column header label for each individual Data Column object's label. This may seem like you are duplicating your efforts but it is critical to do both in order for the Tabular Layout Extract Method to map the table's structure and ultimately collect the table's data.
To collect the Data Table's label, select the Data Table object in the Labels tab. Here, we've selected the Data Table named "Line Items". Lasso the entire header row for the table. You may notice there are more columns on this table than we are collecting. As it is on the document, the table has six columns. But we're only collecting four, the "Quantity", "Description", "Unit Price", and "Line Total" Data Columns. Generally, you should collect the whole row of column headers, even if there are extra columns whose data you are not collecting.
Next, collect each child Data Column's header label. Here, we've selected the "Quantity" Data Column. Lasso the individual column header for the selected Data Column. Here, the stacked label, "Qty. Ord.".
Continue collecting labels for the remaining Data Columns. We have four Data Columns for this Data Table. Therefore, we collect four header labels from the document.

Auto Map Labels

As you add labels for each Document Type, you may find some documents have labels in common. For example, there are only so many ways to label an invoice number. It might be "Invoice Number", "Invoice No", "Invoice #" or even just "Invoice". Some invoices are going to use one label, others another.

When collecting labels for multiple Document Types you can use the "Auto Map" feature to automatically add labels you've previously collected on another Document Type.

So far, we've only collected labels for one Document Type, the "Factura" Document Type.
Now, we're collecting labels for the "Lasku" Document Type.
Press the "Auto Map" button to automatically assign previously collected labels,

Grooper will search the document's text for labels matching those previously collected on other Document Types.

For example, we collected the label "Remit To:" for the "Remit Address" Data Field for the "Factura" Document Type. The "Auto Map" feature found a match for this label on the document and assigned the "Lasku" Document Type's "Remit Address" Data Field the same label.

If a match is not found, the Data Element's label is left blank.

For example, the label for the "Invoice Amount" Data Field for the "Factura" Document Type was "Amount due".
This label was nowhere to be found on this document. The invoice amount is labeled "Total" on the "Lasku" documents. So, the label is left blank for you to collect.

As you keep collecting labels for more and more Document Types, the Auto Map feature will pick up more and more labels, allowing you to quickly onboard new Document Types.

Be aware, you may still need to validate the auto mapped values and make adjustments.

For example, the label "Date" is very generic.
This label does actually correspond to the invoice date on the "Lasku" Document Type in this case.
However, that could label some other date on another Document Type. Even on this document, the label "Date" is returning the "Date" portion of "Ship Date" and other instances where "Date" is found in the text.
- As a side note, there are ways to make simple labels like "Date" more specific to the data they pertain to using "Custom Labels". More on that in the next tab.
You can also make minor adjustments to the mapped labels.
- The mapped label for the "Purchase Order Number" Data Field was "PO Number" (as it was collected for the "Factura" Document Type), but it is more specifically "PO Number:" on the "Lasku" documents. We can just add the colon at the end of the label manually.

Collect Custom Labels

It's important to keep in mind labels are collected for corresponding Data Elements in a Data Model. You collect one label per Data Element (Data Field, Data Section, Data Table or Data Column). What if you want to collect a label that is distinct from a Data Element, one that doesn't necessarily have to do with a value collected by your Data Model? And why would you even want to?

That's what "Custom Labels" are for. Custom labels serve two primary functions:

Providing additional labels for classification purposes.
Providing context labels when a Data Element's label matches multiple points on a document

Custom Labels may only be added to Data Model, Data Section or Data Table objects' labels. Put another way, any Data Element in the Data Model's hierarchy that can have child Data Elements can have custom labels.

When used for classification purposes, custom labels are typically added to the Data Model itself.

First select the Data Element in the Data Model's hierarchy to which you wish to add the label.
- In our case, we're selecting the Data Model itself.
Right-click either the "Header" or "Footer" tab.
Press the "Add Custom Label..." button.
The following dialogue box will appear.
You may enter a name for the custom label, or use the default "Custom ##" naming convention.
Press the "OK" button when finished.

This will add a new label tab, named whatever you named it in the previous step.
- Here, we kept with the default "Custom 01" name.
- Notice the red "X" next to the name "Custom 01" as well. This indicates the label is not matching anything on the document. Currently the label is "Custom 01", which doesn't appear anywhere on the document. We need to change that by collecting a new label.
Collect the custom label by either lassoing the text using the Document Viewer or manually typing in the label.
- For example, the word "Invoice" might be a useful label for classification purposes. This label isn't used to collect anything in our Data Model, but might be helpful to identify this and other invoices from the Factura Technology Corp as "Factura" Document Types. Collecting the label "Invoice" as a Custom Label will allow us to use it as a feature of this Document Type for classification.

You may add more Custom Labels to the selected Data Element by repeating the process described above.

Right-click any of the label tabs.
Add a new label with the "Add Custom Label..." button.

Custom Labels as Context Labels

Some labels are more specific than others. The label "Invoice Date" is more specific than the label "Date". If you see the label "Invoice Date" you know the date you're looking at is the date the invoice was generated. The label "Date" may refer to the invoice's generation date or it could be part of another label like "Due Date". However, some invoice formats will label the invoice date as simply "Date".

For example, the label "Date" on this "Factura" Document Type does indeed correspond to the invoice date for the "Invoice Date" Data Field.
However, this label pops up as part of other labels too, such as the "Date" in "Due Date" or "Order Date".

This can present a challenge for data extraction. The possibilities for false-positive results tend to crop up the more generic the label used to identify a desired value. There are three separate date values identified by the word "Date" (in full or in part) on this document.

This is the second reason Custom Labels are typically added for a Document Type, to provide extra context for generic labels, especially when they produce multiple results on a document, leading to false-positive data extraction.

There are two steps to adding and using a Custom Label for this purpose:

Add the Custom Label.
Marry the Custom Label with the Data Element's label.

We will refer to this type of a Custom Label as a "Context Label" from here out.

The only "trick" to this is adding the Context Label to the appropriate level of the Data Model's hierarchy. Remember, a Custom Label may only be added to a Data Model, Data Section or Data Table object. We cannot add a Custom Label to a Data Field, such as the "Invoice Number" Data Field. To add a Context Label a Data Field can use, we must add the Custom Label to its direct parent Data Element. In the case of the "Invoice Date" Data Field its direct parent Data Element is the Data Model itself. Right-click the "Header" or "Footer" tab and select "Add Custom Label..." to add the Custom Label.
The Custom Label we added was "Date Page". This will provide the simple label "Date" some extra context. Which of the three results for the label "Date" do we want to accept? The one falling within this zone.
Now that we've added the label, we need to marry the Custom Label with the Data Field its giving extra context to. This is done with the Parent property of a Data Field label. In our case, the Custom Label provides extra context for the "Invoice Date" Data Field's label. We've selected the "Invoice Date" Data Field. Select the Parent property. Note: This property is only present for Data Field and Data Column labels. Using the drop down list, select the Custom Label you wish to use for the Context Label.
Notice with this Context Label added... ...We only return a single result for the "Invoice Date" Data Field's label "Date". This is the label we want to associate with this Data Field. The other two results do not fall within the Context Label, and are no longer returned.

Use Label Sets for Classification

Use Label Sets for Field Based Extraction

Use Label Sets for Tabular Extraction

Label Sets and Tabular LayoutCollect Labels

Label Sets and Tabular Layout

Many tables label the columns so the reader knows what the data in that column corresponds to. How do you know the unit price for an item on an invoice? Typically, that item is in a table and one of the columns of that table is labeled "Unit Price" or something similar. Once you read the labels for each column (also called "column headers"), you the reader know where the table begins (below the column headers) and can identify the data in each row (by understanding what the column headers refer to).

This is also the basic idea behind the Tabular Layout Extraction Method. It too utilizes column header labels to "read" tables on documents, or at least as the step number one in modeling the table's structure so that Grooper can extract data from each cell in the table.

Furthermore, using the Tabular Layout method, collected label sets using a Labeling Behavior can also be used to extract data from tables on documents. In this case, the labels collected for the Data Column children of a Data Table are utilized to help model the table's structure.

Once the column header locations are established, the next requirement is a way to understand how many rows are in the table. This is done by configuring at least one Data Column's Value Extractor property. Generally, there is at least one column in a table that is always present for every row in the table. If you can use an extractor to locate that data below its corresponding column header, that gives you a way of finding each row in the table.

And last there are a few other considerations you might need to make. Is every row in the table a single line or are the rows "multiline"? Do you need to clean up the data the Tabular Layout initially extracts for a column by normalizing it with an extractor? Do you need to establish a table "footer" to limit the number of rows extracted?

This tutorial will cover the basic configuration of the Tabular Layout Extraction Method using collected Label Sets and address a few of these considerations.

Collect Labels

Use Label Sets for Sectional Extraction

Additional Information

Include information in this section on the following topics if not able to flesh it out in the About or How To sections. And probably this section will be helpful even if you do talk about it earlier. There's no space in Design Studio to detail this information in a help panel.

Header, Footer, and Static Labels

Custom Labels

Layout Options

Version Differences

2021

The Labeling Behavior is brand new functionality in Grooper version 2021. Prior to this version, its functionality may have been able to be approximated by other objects and their properties (For example, a Data Type using the Key-Value Pair collation is at least in some ways similar to how the Labeled Value Extractor Type works). However, creation of label sets using Document Types and their implementation described above was not available prior to version 2021.

@@ Line 409: / Line 409: @@
 This tutorial will cover the basic configuration of the ''Tabular Layout'' '''''Extraction Method''''' using collected Label Sets and address a few of these considerations.
+</tab>
 <tab name="Collect Labels" style="margin:20px">
 === Collect Labels ===
-</tab>
 </tab>
 </tabs>