Difference between revisions of "Labeling Behavior - 2021"

From Grooper Wiki
Jump to navigation Jump to search
Line 574: Line 574:
 
|}
 
|}
  
 +
[[#Configuring Labelset-Based Classification|Click me to return to the top]]
 
</tab>
 
</tab>
 
<tab name="Collect Labels" style="margin:20px">
 
<tab name="Collect Labels" style="margin:20px">
Line 612: Line 613:
 
|}
 
|}
  
 +
[[#Configuring Labelset-Based Classification|Click me to return to the top]]
 
</tab>
 
</tab>
 
<tab name="Test Classification" style="margin:20px">
 
<tab name="Test Classification" style="margin:20px">
Line 710: Line 712:
  
 
|}
 
|}
 +
 +
[[#Configuring Labelset-Based Classification|Click me to return to the top]]
 
</tab>
 
</tab>
 
</tabs>
 
</tabs>

Revision as of 10:46, 4 May 2021

2021

This article is in development for the upcoming version of Grooper, Grooper 2021. Labeling Behavior is a new Content Type Behavior option in 2021. This information is incomplete and/or may change by the time of release.

The Labeling Behavior is a Content Type Behavior designed to collect and utilize a document's field labels in a variety of ways. This includes functionality for classification and data extraction.

The Labeling Behavior functionality allows Grooper users to quickly onboard new Document Types for structured and semi-structured forms, utilizing labels as a thumbprint for classification and data extraction purposes. Once the Labeling Behavior is enabled, labels are identified and collected using the "Labels" tab of Document Types. These "Label Sets" can then be used for the following purposes:

  • Document classification - Using the Labelset-Based Classification Method
  • Field based data extraction - Using the Labeled Value Extractor Type
  • Tabular data extraction - Using a Data Table object's Tabular Layout Extract Method
  • Sectional data extraction - Using a Data Section object's Transaction Detection Extract Method

FYI: The Labeling Behavior and its functionality discussed in this article are often referred to as "Label Set Behavior" or simply "Label Sets".


About

Asset 22@4x.png

You may download and import the file below into your own Grooper environment (version 2021). This contains the Batch(es) with the example document(s) discussed in this article and the Content Model(s) configured according to the How To section's instructions.

Labeling-behavior-about-01.png

Labels serve an important function on documents. They give the reader critical context to understand where data is located and what it means. How do you know the difference between the date on an invoice document indicating when the invoice was sent and the date indicating when you should pay the invoice? It's the labels. The labels are what distinguishes one type of date from another. For example, "Invoice Date" for the date the invoice was sent and "Due Date" for the date you need to pay by.

Labels can be a way of classifying documents as well. What does one individual label tell you about a document? Well, maybe not much. However, if you take them all together, they can tell you quite a bit about the kind of document you're looking at. For example, a W-4 employee withholding form is going to use different labels than an employee healthcare enrollment form. These are two very different documents collecting very different information. The labels used to collect this information are thus different as well.

Furthermore, you can even tell the difference between two very closely related documents using labels as well. For example, two different invoices from two different vendors may share some similarity in the labels they use to detail information. But there will be some differences as well. These differences can be useful identifiers to distinguish one from the other. Put all together, labels can act as a thumbprint Grooper can use to classify a document as one Document Type or another.

Even though these two invoices share some labels (highlighted in blue), there are others that are unique to each one (highlighted in yellow). This awareness of how one kind of invoice from one vendor uses labels differently from another can give you a method of classifying these documents using their label sets.

Labeling-behavior-about-02.png
Labeling-behavior-about-03.png


The Labeling Behavior is built on these concepts, collecting and utilizing labels for Document Types in a Content Model for classification and data extraction purposes.

As a Behavior, the Labeling Behavior is enabled on a Content Type object in Grooper.

While you can enable Labeling Behavior on any Content Type, in almost all cases, you will want to enable this Behavior on the Content Model.

Typically, you want to collect and use label sets for multiple Document Types in the Content Model, not just one Document Type individually. Enabling the Behavior on the Content Model will enable the Labeling Behavior for all child Document Types, allowing you to collect and utilize labels for all Document Types.

  1. Here, we have selected a Content Model in the Node Tree.
  2. To add a Behavior, select the Behaviors property and press the ellipsis button at the end.
  3. This will bring up a dialogue window to add various behaviors to the Content Model, including the Labeling Behavior
  4. Add the Labeling Behavior using the "Add" button.
  5. Select Labeling Behavior from the listed options.

Labeling-behavior-about-04.png

  1. Once added, you will see a Labeling Behavior item added to the Behaviors list.
  2. Selecting the Labeling Behavior in the list, you will see property configuration options in the right panel.
    • The configuration options in the property panel pertain to fuzzy matching collected labels as well as constrained and vertical wrapping capabilities to target stacked labels.
    • By default, Grooper presumes you will want to use some fuzzy matching and enable constrained and vertical wrapping. These defaults work well for most use cases. However, you can adjust these properties here as needed.
  3. Press the "OK" button to finish adding the Labeling Behavior and exit this window.

Labeling-behavior-about-05.png

Once the Labeling Behavior is enabled, the next big step is collecting label sets for the various Document Types in your Content Model.

  1. With the Labeling Behavior enabled, you will now see a "Labels" tab present for the Content Model.
    • This tab is also now present for each individual Document Type as well.
  2. Label sets are collected in this tab for each Document Type in the Content Model.

Each Document Type has its own set of labels used to define information on the document. For example, the "Factura" Document Type in this Content Model uses the label "PO Number" to call out the purchase order number on this invoice document. A different Document Type, corresponding to a different invoice format, might use a different label such as "Purchase Order Number" or "PO #".

  1. Ultimately, this is the data we want to collect using the Content Model's Data Model.
  2. We use the "Labels" tab to collect labels corresponding to the various Data Elements (Data Fields, Data Tables, and Data Sections) of the Data Model.
    • This provides a user interface to enter a label identifying the value you wish to collect for the Data Elements.
  3. For example, the label "PO Number" identifies the purchase order number for this invoice.
  4. Therefore, the label "PO Number" is collected for the "Purchase Order Number" Data Field in the Data Model.

For more information on collecting label sets for the Document Types in your Content Model see the How To section of this article.

Labeling-behavior-about-06.png

Once label sets are collected for each Document Type, they can be used for classification and data extraction purposes.

For example, labels were used in this case to:

  1. Classify the document, assigning it the "Factura" Document Type.
  2. Extract all the Data Fields seen here, collecting field based data from the document.
  3. Extract the "Line Items" Data Table, collecting the tabular data seen here.

For more information on how to use labels for these purposes, see the How To section of this article.

Labeling-behavior-about-07.png

How To

The Labeling Behavior (often referred to as "Label Set Behavior" or just "Label Sets") are well suited for structured and semi-structured document sets. Label Sets are particularly useful for situations where you have multiple variations for one kind of document or another. While the information you want to extract from the document set may be the same from variation to variation, how the data is laid out and labeled may be very different from one variation of the document to another. Label Sets allow you to quickly onboard new Document Types to capture new form structures.

We will use invoices for the document set in the following tutorials.

In a perfect world, you'd create a Content Model with a single "Invoice" Document Type whose Data Model would successfully extract all Data Elements for all invoices from all vendors every time no matter what.

This often not the case. You may find you need to add multiple Document Types to account for variations of an invoice from multiple vendors. Label Sets give you a method of quickly adding to Document Types to model new variations. In our case, we will presume we need to create one Document Type for each vendor.

We will start with five Document Types for invoices from five vendors.

  • Factura
  • Lasku
  • Envoy
  • Rechnung
  • Arve

Labeling-behavior-how-to-doc-set-01.png

Asset 22@4x.png

You may download and import the file below into your own Grooper environment (version 2021). This contains the Batch(es) with the example document(s) discussed in this tutorial and the Content Model(s) configured according to the instructions.

Collect Label Sets

Navigate to the Labels UI

Collecting labels for the Document Types in your Content Model will be the first thing you want to do after enabling the Labeling Behavior. Labels for each Data Element in the Document Type's Data Model are defined using the "Labels" tab of the Content Model.

  1. Navigate to the "Labels" tab of the Content Model.
  2. With a Batch selected in the "Batch Selector" window panel, select a document folder.
  3. Press the "Set Type..." button to set the Document Type whose labels you wish to collect.
  4. This will bring up the "Set Content Type" window.
  5. From this window, select the Document Type for the selected document folder whose labels you wish to collect.
    • In this case, this document is an invoice from "Factura Technology Corp". We have selected the "Factura" Document Type.
  6. Press "OK" to finish.
FYI If you haven't added a Document Type for the selected document folder yet, you can use the "Create Type" button instead to both create a new Document Type and set it.

Labeling-behavior-about-08.png

  1. Upon setting the Document Type the document folder is assigned the selected Document Type
    • Or in other words, this document is now classified as a "Factura" document.
  2. Upon setting a Document Type, that Document Type's Data Model and its child Data Elements will appear in the label collection UI.
    • Labels are primarily collected as they correspond to Data Elements in a Data Model. However, we will see how to add custom labels that don't correlate to a Data Element as well by the end of this tutorial. Custom labels are often used as additional features for document classification.

Labeling-behavior-about-09.png

Click me to return to the top

Collect Field Labels

Now that this document has been classified (assigned a Document Type from our Content Model), we can collect labels for its Document Type. This can be done in one of three ways:

  1. Lassoing the label in the "Document Viewer".
  2. Double-clicking the label in the Document Viewer.
  3. Typing the label in manually.
Going forward, this tutorial presumes you have obtained machine readable text from these documents, either OCR'd text or native text, via the Recognize activity.

Generally the quickest way is by simply lassoing the label in the "Document Viewer".

  1. Select the Data Element whose label you wish to collect.
    • Here, we are selecting the "Invoice Number" Data Field.
  2. Press the "Select Region" button.
  3. With your cursor, lasso around the text label on the document.

Labeling-behavior-about-10.png

  1. Upon lassoing the label in the Document Viewer, the OCR'd or native text behind the selected region will be used to populate the Data Element's label.
    • At this point, the label for the "Invoice Number" Data Field is now "Invoice Number" because that's the text data we selected. Whatever text characters you lasso with your cursor will be assigned as the label.
  2. Notice this label also now appears in the Header tab below. That's because we had the Header tab selected when we lassoed the label.
    • The text collected here ("Invoice Number") is the Header label for the "Invoice Number" Data Field.
    • We'll talk about the difference between Header, Footer, and Static labels at later points. This will be important when using labels for data extraction purposes.
    • However, think of the Header label as the "main" label identifying the Data Element's value. The Header label indicates where the data "starts" (If you're a human reading the document, you start looking for the corresponding data after you find and read the Header label. The same is true for Grooper).

Labeling-behavior-about-11.png

If you choose, you may also manually enter a label for a Data Element by simply typing it into the text box.

  1. Here we've selected the "Purchase Order Number" Data Field and entered "PO Number".
  2. This will correspond to the label "PO Number" on the document itself.
Whether lassoing the text using the Document Viewer or manually typing into the textbox, you may collect a maximum of one Header label and one Footer label per Data Element (and one Static label for Data Fields) per Document Type.

Labeling-behavior-about-12.png

  1. Upon entering the label into the text box, just you'll see the label in the Header tab, just like we saw when we collected a label by lassoing the text on the Document Viewer.
  2. Notice as well, there is a green checkmark next to the Header tab (and the box below is highlighted green).
    • This means the text label is matching something on the document. If it did not, you would see a red "X" next to the Header tab and the box below would be highlighted red.
  3. Also note, since this label is being returned on this document, we can verify it in the Document Viewer. The selected Data Field ("Purchase Order Number") and it's text label are highlighted green on the document, indicating 1) it was successfully located on the document and 2) where it was located.

Labeling-behavior-about-13.png

  1. Continue lassoing or manually entering labels until all are collected.
  2. Next, we will focus on collecting labels from tables and table columns (the Data Table and Data Column elements in a Data Model). The process is essentially the same, but bears some extra explanation.

Labeling-behavior-about-14.png

Click me to return to the top

Collect Table and Column Labels

Table and column labels can be used for tabular data extraction as well, setting a Data Table object to use the Tabular Layout Extract Method.

When collecting labels for this method of table extraction, keep in mind you must collect the individual column headers, and may optionally collect both the full row of column header labels as well.

While it is optional, it is generally regarded as best practice to capture the full row of column header labels. This will generally increase the accuracy of your column label extraction. We will do both in this tutorial.

  1. We will collect the full row of column header labels for the Data Table object's label.
  2. We will collect each individual column header label for each individual Data Column object's label.

This may seem like you are duplicating your efforts but it is often critical to do both in order for the Tabular Layout Extract Method to map the table's structure and ultimately collect the table's data.

  • In particular if you are dealing with OCR text data containing inaccurate character recognition data, establishing the full header row for the table will boost the fuzzy matching capabilities of the Labeling Behavior.

Labeling-behavior-about-15.png

  1. To collect the Data Table's label, select the Data Table object in the Labels tab.
    • Here, we've selected the Data Table named "Line Items".
  2. Lasso the entire header row for the table.
    • You may notice there are more columns on this table than we are collecting. As it is on the document, the table has six columns. But we're only collecting four, the "Quantity", "Description", "Unit Price", and "Line Total" Data Columns.
    • Generally, you should collect the whole row of column headers, even if there are extra columns whose data you are not collecting.

Labeling-behavior-about-16.png

  1. Next, collect each child Data Column's header label.
    • Here, we've selected the "Quantity" Data Column.
  2. Lasso the individual column header for the selected Data Column.
    • Here, the stacked label, "Qty. Ord.".

Labeling-behavior-about-17.png

  1. Continue collecting labels for the remaining Data Columns.
  2. We have four Data Columns for this Data Table. Therefore, we collect four header labels from the document.

Labeling-behavior-about-18.png

Click me to return to the top

Auto Map Labels

As you add labels for each Document Type, you may find some documents have labels in common. For example, there are only so many ways to label an invoice number. It might be "Invoice Number", "Invoice No", "Invoice #" or even just "Invoice". Some invoices are going to use one label, others another.

When collecting labels for multiple Document Types you can use the "Auto Map" feature to automatically add labels you've previously collected on another Document Type.

  1. So far, we've only collected labels for one Document Type, the "Factura" Document Type.
  2. Now, we're collecting labels for the "Lasku" Document Type.
  3. Press the "Auto Map" button to automatically assign previously collected labels,

Labeling-behavior-about-19.png

Grooper will search the document's text for labels matching those previously collected on other Document Types.

  1. For example, we collected the label "Remit To:" for the "Remit Address" Data Field for the "Factura" Document Type. The "Auto Map" feature found a match for this label on the document and assigned the "Lasku" Document Type's "Remit Address" Data Field the same label.

If a match is not found, the Data Element's label is left blank.

  1. For example, the label for the "Invoice Amount" Data Field for the "Factura" Document Type was "Amount due".
  2. This label was nowhere to be found on this document. The invoice amount is labeled "Total" on the "Lasku" documents. So, the label is left blank for you to collect.

As you keep collecting labels for more and more Document Types, the Auto Map feature will pick up more and more labels, allowing you to quickly onboard new Document Types.

Labeling-behavior-about-20.png

Be aware, you may still need to validate the auto mapped values and make adjustments.

  1. For example, the label "Date" is very generic.
  2. This label does actually correspond to the invoice date on the "Lasku" Document Type in this case.
  3. However, that could label some other date on another Document Type. Even on this document, the label "Date" is returning the "Date" portion of "Ship Date" and other instances where "Date" is found in the text.
    • As a side note, there are ways to make simple labels like "Date" more specific to the data they pertain to using "Custom Labels". More on that in the next tab.
  4. You can also make minor adjustments to the mapped labels.
    • The mapped label for the "Purchase Order Number" Data Field was "PO Number" (as it was collected for the "Factura" Document Type), but it is more specifically "PO Number:" on the "Lasku" documents. We can just add the colon at the end of the label manually.

Labeling-behavior-about-21.png

Click me to return to the top

Collect Custom Labels

It's important to keep in mind labels are collected for corresponding Data Elements in a Data Model. You collect one label per Data Element (Data Field, Data Section, Data Table or Data Column). What if you want to collect a label that is distinct from a Data Element, one that doesn't necessarily have to do with a value collected by your Data Model? And why would you even want to?

That's what "Custom Labels" are for. Custom labels serve two primary functions:

  1. Providing additional labels for classification purposes.
  2. Providing context labels when a Data Element's label matches multiple points on a document

Custom Labels may only be added to Data Model, Data Section or Data Table objects' labels. Put another way, any Data Element in the Data Model's hierarchy that can have child Data Elements can have custom labels.

When used for classification purposes, custom labels are typically added to the Data Model itself.

  1. First select the Data Element in the Data Model's hierarchy to which you wish to add the label.
    • In our case, we're selecting the Data Model itself.
  2. Right-click either the Header or Footer tab.
  3. Press the "Add Custom Label..." button.
  4. The following dialogue box will appear.
  5. You may enter a name for the custom label, or use the default "Custom ##" naming convention.
  6. Press the "OK" button when finished.

Labeling-behavior-about-22.png

  1. This will add a new label tab, named whatever you named it in the previous step.
    • Here, we kept with the default "Custom 01" name.
    • Notice the red "X" next to the name "Custom 01" as well. This indicates the label is not matching anything on the document. Currently the label is "Custom 01", which doesn't appear anywhere on the document. We need to change that by collecting a new label.
  2. Collect the custom label by either lassoing the text using the Document Viewer or manually typing in the label.
    • For example, the word "Invoice" might be a useful label for classification purposes. This label isn't used to collect anything in our Data Model, but might be helpful to identify this and other invoices from the Factura Technology Corp as "Factura" Document Types. Collecting the label "Invoice" as a Custom Label will allow us to use it as a feature of this Document Type for classification.

Labeling-behavior-about-23.png

You may add more Custom Labels to the selected Data Element by repeating the process described above.

  1. Right-click any of the label tabs.
  2. Add a new label with the "Add Custom Label..." button.

Labeling-behavior-about-24.png

Custom Labels as Context Labels

Some labels are more specific than others. The label "Invoice Date" is more specific than the label "Date". If you see the label "Invoice Date" you know the date you're looking at is the date the invoice was generated. The label "Date" may refer to the invoice's generation date or it could be part of another label like "Due Date". However, some invoice formats will label the invoice date as simply "Date".

  1. For example, the label "Date" on this "Factura" Document Type does indeed correspond to the invoice date for the "Invoice Date" Data Field.
  2. However, this label pops up as part of other labels too, such as the "Date" in "Due Date" or "Order Date".

This can present a challenge for data extraction. The possibilities for false-positive results tend to crop up the more generic the label used to identify a desired value. There are three separate date values identified by the word "Date" (in full or in part) on this document.

Labeling-behavior-about-25.png

This is the second reason Custom Labels are typically added for a Document Type, to provide extra context for generic labels, especially when they produce multiple results on a document, leading to false-positive data extraction.

There are two steps to adding and using a Custom Label for this purpose:

  1. Add the Custom Label.
  2. Marry the Custom Label with the Data Element's label.

We will refer to this type of a Custom Label as a "Context Label" from here out.

The only "trick" to this is adding the Context Label to the appropriate level of the Data Model's hierarchy.

Remember, a Custom Label may only be added to a Data Model, Data Section or Data Table object. We cannot add a Custom Label to a Data Field, such as the "Invoice Number" Data Field.

To add a Context Label a Data Field can use, we must add the Custom Label to its direct parent Data Element.

  1. In the case of the "Invoice Date" Data Field its direct parent Data Element is the Data Model itself.
  2. Right-click the "Header" or "Footer" tab and select "Add Custom Label..." to add the Custom Label.

Labeling-behavior-about-26.png

  1. The Custom Label we added was "Date Page".
  2. This will provide the simple label "Date" some extra context.
    • Which of the three results for the label "Date" do we want to accept? The one falling within this zone.

Labeling-behavior-about-27.png

Now that we've added the label, we need to marry the Custom Label with the Data Field its giving extra context to. This is done with the Parent property of a Data Field label.

  1. In our case, the Custom Label provides extra context for the "Invoice Date" Data Field's label. We've selected the "Invoice Date" Data Field.
  2. Select the Parent property.
    • Note: This property is only present for Data Field and Data Column labels.
  3. Using the drop down list, select the Custom Label you wish to use for the Context Label.

Labeling-behavior-about-28.png

  1. Notice with this Context Label added...
  2. ...We only return a single result for the "Invoice Date" Data Field's label "Date". This is the label we want to associate with this Data Field.
  3. The other two results do not fall within the Context Label, and are no longer returned.

Labeling-behavior-about-29.png

Click me to return to the top

Use Label Sets for Classification

About Labelset-Based Classification

Label Sets can be used for classifying documents using the Labelset-Based Classification Method. For structured and semi-structured forms labels end up being a way of identifying a document. Without the field data entered, the the labels are really what define the document. You know what kind of document you're looking at based on what kind of information is presented and in the case of Labelset-Based classification how that data is labeled. Even when those labels are very similar from one variant to the next, they end up being a thumbprint of that variant. For example, you might use Labelset-Based classification to create Document Types for different variations of invoices from different vendors. The information presented on each variant from each vendor will be more or less the same, and some labels will be more commonly used by different vendors (such as "Invoice Number"). However, if there is enough variation in the set of labels, you can easily differentiate an invoice from one vendor verses another just based on the variation in labels.

Take these four "documents". Each one is collecting the same information:

  • A person's name
  • Their social security number
  • Their birthday
  • Their phone number
  • Their address

So we might have five Data Fields in our Data Model, one for each piece of information. We'd also collect one label for each Data Field as well.

While the data we want from these documents is the same, there is some variation in the labels used for each different document type. If we wanted to distinguish these four documents from each other by classifying using the Labelset-Based Classification Method. This is all done measuring the similarity between the collected label sets for each Document Type.


How is Document Type "B" different from Document Type "A"?

  • It uses the label SSN: instead of Social Security Number:.

How is Document Type "C" different from Document Type "A"?

  • It uses the labels SSN: instead of Social Security Number: and DOB: instead of Date of Birth:.

How is Document Type "D" different from Document Type "A"?

  • It uses the labels SSN: instead of Social Security Number:, DOB: instead of Date of Birth:, and Phone #: instead of Phone Number.

Labeling-behavior-classification-01.png

Labeling-behavior-classification-02.png

Labeling-behavior-classification-03.png

Labeling-behavior-classification-04.png

Using the Labelset-Based Classification Method unclassified documents are classified by assigning the document the Document Type whose labels are most similar. The basic concept is "similarity" is determined by how many labels are shared between the unclassified document and the label sets collected for the Document Types in your Content Model. The unclassified document is assigned the Document Type with the highest degree of similarity between matched labels and the Document Types' label sets.

The similarity calculation is very straightforward. Grooper searches for labels collected for every Document Type and measures the total character difference between all the labels matched on the document.

If each of these five labels is collected for each Document Type's Label Set, you'd have the following character totals for the set.

  • Document Type "A" - 63 total label characters.
  • Document Type "B" - 44 total label characters.
  • Document Type "C" - 34 total label characters.
  • Document Type "D" - 29 total label characters.

How similar is Document Type "A" to Document Type "B"?

  • "A" uses the label Social Security Number: instead of SSN:
  • However, there is a match for the remaining four labels.
  • The remaining four labels, Name:, Date of Birth, Phone Number: and Address: are comprised of 40 characters.
  • The similarity score is the percentage of matched label characters divided by the total characters in the Document Type's label set.
    • 40 matched label characters / 44 total label characters = 0.9091
    • "A" is roughly 91% similar to "B"

How similar is Document Type "A" to Document Type "C"?

  • "A" uses the label Social Security Number: instead of SSN: and Date of Birth instead of DOB:
  • However, there is a match for the remaining three labels.
  • The remaining three labels, Name:, Phone Number: and Address: are comprised of 26 characters.
  • The similarity score is the percentage of matched label characters divided by the total characters in the Document Type's label set.
    • 26 matched label characters / 34 total label characters = 0.7647
    • "A" is roughly 76% similar to "B"

How similar is Document Type "A" to Document Type "D"?

  • Figure out what labels from "A" match "D", and do the math.

Labeling-behavior-classification-01-01.png

Labeling-behavior-classification-02-01.png

Labeling-behavior-classification-03-01.png

Labeling-behavior-classification-04-01.png

If we ran one of these "documents" into Grooper, we can see these results very clearly.

  1. The document shares all five labels in common with the "A" Document Type.
  2. Grooper searches for labels matching the label sets for all Document Types in the Content Model and creates a similarity score for each one.
    • You can see the math described above play out here. Matching all labels in the "A" Document Type's label set, the document is considered 100% similar. Less so for the other Document Types because while they share some labels (like Name:), some are different (like Social Security Number: versus SSN:
  3. Upon classification, the document folder is assigned the Document Type with the highest similarity score.
    • In this case the "A" Document Type.

Labeling-behavior-classification-how to-01.png

Configuring Labelset-Based Classification

Next, we will walk through the steps required to enable and configure the Labelset-Based Classification Method, using our example set of invoice documents.

The basic steps are as follows:

  1. Set the Content Model's Classification Method property to Labelset-Based
  2. Collect labels for each Document Type
  3. Test classification
  4. Reconfigure, updating existing Document Types' Label Sets and adding new Document Types as needed.

Assign the Labelset-Based Classification Method

Once you've figured out you want to use Label Sets to classify your documents, you need to tell your Content Model that's what you want to do! This is done by setting the Content Model's Classification Method property to Labelset-Based.

  1. Select a Content Model in the Node Tree.
    • We've selected the "Labeling Behavior - Invoices" Content Model we've been working with in this How To section of the article.
  2. Select the Classification Method property.
  3. Using the dropdown menu, select Labelset-Based from the list of options.

Next, we will collect labels for each Document Type in the Content Model.

  1. Note we've already added a Labeling Behavior to the Behaviors property.
    • It doesn't matter whether you add a Labeling Behavior and/or collect labels before selecting Labelset-Based for the Classification Method' or after.
    • However, you will need to add the Labeling Behavior at some point in order to collect label sets for the Document Types and ultimately use the Labelset-Based method for document classification. Visit the tutorial above if you're unsure how to add the Labeling Behavior to the Content Model.

Labeling-behavior-classification-how to-02.png

Click me to return to the top

Collect Labels

See the above how to (Collect Label Sets) for a full explanation of how to collect labels for Document Types in a Content Model. The rest of this tutorial will presume you have general familiarity with collecting labels.

  1. Switch to the "Labels" tab.
  2. Collect labels for each Data Element in the Document Type's Data Model.
  3. Collect labels for each Document Type in the Data Model.

Labeling-behavior-classification-how to-03.png

Beta Bug Alert!

Table headers are often very useful for Labelset-Based classification, and it generally is the case you want to use them as a classification feature. Currently, if you want to use a Data Table object's labels for classification, you must set the Data Table's Minimum Row Count property to at least "1". This is a known issue in the current version of Grooper and likely will change.

However, if you find Data Table and/or Data Column labels are not included in determining document similarity during classification, do the following:

  1. Navigate to the Data Table object in the Node Tree.
  2. Expand the Row Count Range property.
  3. Select the Minimum property.
  4. Enter 1.

If you have multiple Data Table objects in your Data Model, you will need to repeat these steps for each one.

Labeling-behavior-classification-how to-26.png

Click me to return to the top

Test Classification

In general, regardless of the Classification Method used, one of three things is going to happen to Batch Folders in a Batch during classification.

  1. The folder will be assigned the correct Document Type.
  2. The folder will be assigned the wrong Document Type.
  3. The folder will be assigned no Document Type at all.

The Labelset-Based method is no different. If all folders are classified correctly, that's great. However, testing is all about ensuring this is the case and figuring out where and why problems arise when folders are classified wrong or not classified at all.

We will look at a couple examples of how classification can go wrong using the Labelset-Based method, why that is the case, and what to do about it.

FYI

The example Batch in the rest of this tutorial is purposefully small to illustrate a few key points. In the real world, you will want to test using a much larger batch with several examples of each Document Type.

  1. The easiest way to test classification (for the Labelset-Based method or any other) is with the "Classification Testing" tab of the Content Model.
  2. Select a test Batch with the "Batch Selector" dropdown window.
  3. Select a Batch Folder and press the "Classify" button to classify a single document folder.
  4. Select a Batch Folder and press the "Classify All" button to classify all document folders in the Batch.

Labeling-behavior-classification-how to-04.png

Now we just need to evaluate the success or failure of our classification. Let's look at a few documents in our Batch before detailing what we will do to resolve any classification errors.

  1. This is a complete success!
    • The Batch Folder has been assigned the "Factura" Document Type.
  2. It indeed should have been classified so. It is an invoice from the Factura Technology Corp.
  3. Its similarity score is 100% similar to the "Factura" Document Type.
    • This means a match has been found for all labels in the "Factura" Document Type's label set.

Labeling-behavior-classification-how to-05.png

  1. This is a mitigated success.
    • The Batch Folder has been assigned the "Envoy" Document Type.
  2. It indeed should have been classified so. It is an invoice from Envoy Imaging Systems.
  3. However, it's a mitigated success in that its similarity score is only 85%.
    • That means only 85% of the labels located on this document match the label set for the "Envoy" Document Type.
  4. In this case, this is due to poor OCR data. While some labels may be present on the document, their OCR data is too garbled to match the label in the label set.
    • For example, the label Invoice was not matched because the text was OCR'd as "nvoice".
    • But a win is a win! Part of the reason Labelset-Based can be an effective classification method is you can miss a few labels due to poor OCR and still end up classifying the document appropriately. It is the set as a whole which determines similarity. As long as the document is more similar to the correct Document Type than any of the other Document Types, Grooper has made the right classification decision.

Labeling-behavior-classification-how to-06.png

  1. This is a mitigated failure.
    • The Batch Folder should have been assigned the "Envoy" Document Type but it was unclassified.
  2. This is due to its similarity to the "Envoy" Document Type's Label Set falling bellow 60%.
    • 60% is the default Minimum Similarity for this Content Model. If a Batch Folder fails to achieve a similarity score above 60%, it will remain unclassified, as is the case here.
    • But that's so close! It just fell short in terms of similarity between matched labels and the "Envoy" Document Type's Label Set.
  3. In this case, several of the labels for the Data Elements of our Data Model are smudged on the document. OCR was unable to return these portions of the document. Therefore, the label's were not matched.
  4. Remember we collect one label per Data Element. However, there's all kinds of labels on this document for data we don't necessarily care about. Do we have a Data Field for the "Salesperson ID" field on this invoice? No, it's not data we're choosing to collect.
    • But just because we don't have a Data Field for it doesn't mean it's not a useful label for classification. We will look at how to create custom labels for classification purposes in the next section, Common Problems and Solutions.

Labeling-behavior-classification-how to-07.png

  1. This is also a mitigated failure.
    • The Batch Folder should have been assigned a "Stuff and Things" Document Type but it was unclassified.
  2. This is a variation of an invoice from the vendor "Stuff and Things"
  3. You may notice the "Stuff and Things" Document Type does not appear at any similarity in our similarity list.
  4. That's because there isn't a "Stuff and Things" Document Type yet. We need to add one and collect labels for it.
    • This is fairly common with a Labelset-Based approach to classification (and indeed the use of Label Sets in general). It often has its most utility in situations where you have a lot of variants of one particular kind of document. The general idea is to use Label Sets to distinguish between the variants by creating one Document Type for each variant, each with their own unique Label Set.
    • Such is the case with invoices. There's lots of different invoice formats, often unique to each vendor. When you get one in a Batch you haven't seen before, you will need to add a new Document Type to account for the new variant. However, as we will see in the next section, onboarding new Document Types with Label Sets is relatively quick and painless.

Labeling-behavior-classification-how to-08.png

  1. This is a more severe version of the failure seen in the previous example.
    • The Batch Folder should have been assigned a "Standard" Document Type but it was assigned the wrong Document Type, the "Rechnung" Document Type.
  2. However, we don't have a "Standard" Document Type yet. Just like the previous example, we will need to add one and collect labels for it.
  3. The only think we will need to watch out for is making sure once we do add a Document Type for the invoices from Standard Products, it classifies more confidently than the "Rechnung" Document Type, beating out its similarity score and receiving the "Standard" Document Type.

Labeling-behavior-classification-how to-09.png

  1. This is a complete failure.
    • The Batch Folder should have been assigned the "Envoy" Document Type but it was unclassified.
  2. The document is of poor enough quality to get near unusable OCR results.
  3. This resulted in a paltry similarity score of 23%.

What can we do about this?

Sometimes you have to know when to stop. Will it be worth it to reconfigure your Content Model and Label Sets to force Grooper to classify this document in one way or another? Probably not. This is more likely than not an extreme outlier, not representative of the larger document set. It may be easier to kick this document (and other outliers) out to human review, especially if reconfiguring the Content Model is going to negatively impact results in other ways.

You have to know when to leave well enough alone. Outliers like this are a good example of when to do just that.

Labeling-behavior-classification-how to-10.png

Click me to return to the top

Common Problems and Solutions

Beta Bug Alert!

Table headers are often very useful for Labelset-Based classification, and it generally is the case you want to use them as a classification feature. Currently, if you want to use a Data Table object's labels for classification, you must set the Data Table's Minimum Row Count property to at least "1". This is a known issue in the current version of Grooper and likely will change.

However, if you find Data Table and/or Data Column labels are not included in determining document similarity during classification, do the following:

  1. Navigate to the Data Table object in the Node Tree.
  2. Expand the Row Count Range property.
  3. Select the Minimum property.
  4. Enter 1.

If you have multiple Data Table objects in your Data Model, you will need to repeat these steps for each one.

Labeling-behavior-classification-how to-26.png

Custom Labels to Boost Similarity

  1. In the above tutorial, we saw this document failed to classify correctly.
    • The Batch Folder should have been assigned the "Envoy" Document Type but it was unclassified.
  2. This is due to its similarity to the "Envoy" Document Type's Label Set falling bellow 60%.
    • 60% is the default Minimum Similarity for this Content Model. If a Batch Folder fails to achieve a similarity score above 60%, it will remain unclassified, as is the case here.
    • But that's so close! It just fell short in terms of similarity between matched labels and the "Envoy" Document Type's Label Set.
  3. In this case, several of the labels for the Data Elements of our Data Model are smudged on the document. OCR was unable to return these portions of the document. Therefore, the label's were not matched.
  4. Remember we collect one label per Data Element. However, there's all kinds of labels on this document for data we don't necessarily care about. Do we have a Data Field for the "Salesperson ID" field on this invoice? No, it's not data we're choosing to collect.

Just because we don't have a Data Field for it doesn't mean it's not a useful label for classification. Even though we don't need to extract the salesperson's identification number, the fact that label "Salesperson ID" is present on these invoices could be important. It's another feature that makes up the "Envoy" Document Type. We just need a way of telling Grooper to use this label for classification, even though we can ignore it when it comes time to extract data from these documents.

That is one of the reasons for adding custom labels to a Document Type's Label Set.

Labeling-behavior-classification-how to-07.png

  1. To add a custom label, first navigate to the "Labels" tab of the Content Model.
  2. Either:
    1. Select a document folder in the Batch selector of the desired Document Type.
    2. Or select a document folder and use the "Set Type..." button to assign it the desired Document Type.
      • In our case we want to add a custom label to the "Envoy" Document Type. We have selected the document folder in the Batch and assigned it the "Envoy" Document Type.
  3. Select a Data Element from the Data Model to which you wish to add the custom label.
    • Most commonly, when adding a custom label for classification purposes, you'll just want to add it to the Data Model root itself, as we've selected here.
  4. Right-click one of the label tabs, "Header" or "Footer".
    • It doesn't matter which one, you just need to right click any label tab.
  5. Select "Add Custom Label..."

Labeling-behavior-classification-how to-11.png

  1. This will bring up the following "Add Custom Label" window.
  2. Name the custom label whatever you like.
  3. Press the "OK" button to add the label.

Labeling-behavior-classification-how to-12.png

  1. Adding the custom label will add a new label tab named whatever you named it.
    • In this case "Salesperson ID".
  2. Using the text editor, collect the label (either typing it in or lassoing or double clicking it in the document viewer).
    • In this case, Salesperson ID
    • FYI: Grooper will automatically enter whatever you title the custom label. So, in our case, all we actually did was change the custom label's name to "Salesperson ID" and the label Salesperson ID was automatically populated.
  3. This will add the label "Salesperson ID" to the "Envoy" Document Type's Label Set.

Now that this label is in the Label Set, it will be considered a label during classification. The label's there. It's part of the document, whether we're extracting the value or not. We "tell" Grooper labels like these should be considered features for classification by creating custom labels.

FYI You can add as many custom labels as you want.

Indeed, you may want multiple custom labels, adding more label features that distinguish one Document Type to another. To add multiple custom labels, just repeat the process described above, right-clicking the label tabs and adding a new custom label for each label you want to collect.

Labeling-behavior-classification-how to-13.png

When we re-classify this Batch, we will see some different results.

  1. Navigate to the "Classification Testing" tab to test classification with the custom label added.
  2. Press the "Classify All" button to classify all document folders in the Batch.
  3. Notice this document now classifies correctly as an "Envoy" Document Type!
  4. Before we added the custom label, this only achieved a similarity score of 59%, falling short of the 60% minimum similarity threshold. Now, it scores a 63% similarity.
    • With another label added to the Label Set, there's more context to what comprises this Document Type.
    • And that's with just one custom label added. There are tons more labels we could collect as custom labels on the document, likely further increasing the similarity score.

Labeling-behavior-classification-how to-14.png

Adding New Document Types

The Labelset-Based classification method makes some assumptions about your document processing approach. It shines with structured and semi-structured forms. Labels, more or less, "stay put" on these kinds of documents. You'll see the same field labels over and over again even though the field values will change from document to document. This presumes your Document Types will be very regular (or rigid, with one Label Set very specifically corresponding to one Document Type). If you encounter a new form or variant of an existing form, you likely will need to account for it with a new Document Type.

  1. Such is the case for this document we encountered in the previous tutorial.
  2. The document is unclassified because it doesn't match any of the Label Sets for the existing Document Types.
    • More specifically, its similarity score to the existing Document Types does not meet the 60% minimum similarity threshold for this Content Model.
  3. This should be a "Stuff and Things" Document Type, but we don't have one yet. We need to add it and collect its Label Set to correctly classify the document.

Luckily, the process of adding new Document Types and defining their label sets is quick and painless and actually can become easier the more Document Types you add to the Content Model.

Labeling-behavior-classification-how to-15.png

You can do the whole thing in the "Labels" tab of the Content Model.

  1. Navigate to the "Labels" tab in the Content Model.
  2. Select the unclassified document folder for which you want to create a new Document Type.
  3. Press the "Create Type..." button.
  4. This will pop up the following window to add a Document Type.
  5. Name the Document Type whatever you like.
    • In our case we named it "Stuff and Things", for the invoice from the very real company, Stuff and Things, that sells stuff, as well as things.
  6. Press the "OK" button to finish and add the Document Type.

Labeling-behavior-classification-how to-16.png

  1. This will add the Document Type to the Content Model
  2. It will also assign the Document Type to the selected document folder in the test Batch.
  3. Collect labels for the document as discussed in the Collect Labels section of this article.

That's it! You've added a new Document Type and collected its Label Set.

  • Keep in mind, as you add new Document Types to the Content Model you will want to perform regression testing to ensure your classification model is still accurate.

Labeling-behavior-classification-how to-23.png

As you keep adding more and more Document Types to the Content Model, you will inevitably keep adding more and more labels for the Data Elements in your Data Model. Eventually, you will come across a new document variant that shares a lot of similarity with an already existing Document Type.

  1. Such was the case with these three documents. They were confidently classified as "Rechnung" Document Types.
  2. Their similarity is 85% - 87%.
  3. However, these aren't invoices from the vendor Rechnung, they are from the vendor Standard Products.
    • They simply share a lot of the same labels. Interestingly, this "problem" is actually going to end up making our job even easier when adding the new Document Type.

Labeling-behavior-classification-how to-17.png

This is where the label auto-map functionality comes in handy.

  1. Add the new Document Type
  2. Assign the right document folder (whose labels you want to collect) the new Document Type.
  3. Press the "Auto-Map" button.

Grooper will search for matching labels already collected in the Label Sets of other Document Types.

  1. In this case, there was some kind of matching label from another Document Type for nearly every Data Element in the Data Model.
  2. The only thing we have to do now is review the auto-mapped labels, collect any that were not mapped, and re-collect or edit any labels that are not accurate.
    • For example, this header label for the "Line Items" Data Table is not quite right. It's red and not green because there's another column header label in the Standard invoices' line items table.

Labeling-behavior-classification-how to-18.png

  1. Upon collecting the full header label for this table, everything matches!
    • That's it! We've added a new Document Type and were able to auto-map all labels except one with the press of a button.
  2. We can now press "Save", test our classification, and see if these documents classify correctly.

Labeling-behavior-classification-how to-19.png

  1. We will test out our new Document Type using the "Classification Testing" tab.
  2. Press the "Classify All" button.
  3. With the Document Type for the Standard invoices, and its Label Set collected, these three document folder are now classified correctly.

Labeling-behavior-classification-how to-20.png

Volatile Labels

Sometimes, you will collect a label you do not want to use for classification purposes. Most often, this is because the label may or may not be present depending on the document.

For example, some of these invoices from Standard Products have the sales tax totaled on the document. However, some do not.

This is called a "Volatile" label. Its presence on a document is unpredictable. Sometimes it's there. Sometimes it's not. It's an optional piece of information. However, because it's optional (or "volatile") we don't actually want to include this as a label for classification. It's going to decrease the similarity score for documents who do not contain the label.

Labeling-behavior-classification-how to-21.png

  1. For example, the selected document here does not have the tax listed on the document.
  2. Since that label is not present, its similarity is lower than if it were present.
    • It drops from 100% to 98% in this case. Now, this may not be a critical drop in similarity for this case, but very well could be for others depending on their OCR quality or presence of multiple volatile labels.

Labeling-behavior-classification-how to-22.png

You can indicate these kinds of labels are "volatile" and should not be considered for classification. Whether it's there or not, Grooper will not include it as a feature to measure the similarity between an unclassified document and the Document Type.

  1. To do this, navigate to the "Labels" tab of the Content Model.
  2. Select the Data Element whose label you wish to turn volatile.
    • In our case, we wish to make the "Tax" Data Field's label volatile. As we've seen, sometimes its present on the document and sometimes it's not.
  3. Change the Volatile property from False to True.

Labeling-behavior-classification-how to-24.png

  1. Now, when we classify this document folder...
  2. ...even though the sales tax label is not present on the document...
  3. ...its similarity is 100%!
    • With the label, Tax set as a volatile label, it is no longer considered during the similarity calculation. With it missing from the document, it no longer negatively impacts the similarity score.

Labeling-behavior-classification-how to-25.png


Use Label Sets for Field Based Extraction

Label Sets and the Labeled Value Extractor Type

Intro to The Labeled Value Extractor

For most static field based extraction, the Labeling Behavior leverages the Labeled Value Extractor Type. Let's first briefly examine how Labeled Value works outside of the Labeling Behavior functionality.

As the name implies, Labeled Value extractor is designed to return labeled values. A common feature of structured forms is to divide information across a series of fields. But it's not as if you just have a bunch of data randomly strewn throughout the document. Typically, the field's value will be identified by some kind of label. These labels provide the critical context to what the data refers to.

Labeled Value relies on the spatial relationship between the label and the value. Most often labels and their corresponding values are aligned in one of two ways.

1. The value will be to the right of the label.

Value-reader-extractor-types-08.png

2. The value will be below the label.

Value-reader-extractor-types-07.png

Labeled Value uses two extractors itself, one to find the label and another for the value. If the two extractors results are aligned horizontally or vertically within a certain amount of space (according to how the Labeled Value extractor is configured), the value's result is returned.

  1. For example, we could configure this "Invoice Number" Data Field to utilize the Labeled Value extractor to return the invoice number on the document.
    • Keep in mind this is the "hard" way of doing things. As we will see, the Labeling Behavior will make this process easier.
  2. We've set the Value Extractor to Labeled Value
  3. The label is returned by the Label Extractor
    • Here, set to a Pattern Match extractor using the regex pattern Invoice Number
  4. The value is returned by the Value Extractor
    • Here, set to a Pattern Match extractor using the regex pattern [A-Z]{2}[0-9]{6}
  5. The Maximum Distance property is used to determine alignment relationship between the label and the value as well as the maximum distance between the label and value.
    • The default settings are used here, indicating the value can be aligned horizontally, up to 2 inches from the right of the label, or it can be aligned vertically, up to 2 inches below the label.
  6. Upon execution, the Label Extractor first finds the label, then looks to see if anything matching the Value Extractor is located according to its layout configuration.
    • Sure enough, there is a result, "IN165798".
  7. The Value Extractor's result is collected for the Data Field upon running the Extract activity.

Labeling-behavior-how-to-field-extraction-01.png

However, the Labeled Value extractor's set up is a little different when combining it with the Labeling Behavior. The end result is a simpler configuration, utilizing collected labels for the Label Extractor.

Label Sets and Labeled Value

Since this Content Model utilizes the Labeling Behavior, at least part of the setup described in the previous tab was unnecessary. If you've collected a label for the Data Field and that Data Field's Value Extractor is set to Labeled Value, there is no need to configure a Label Extractor. Instead, Grooper will pass through the collected label to the Labeled Value extractor.

  1. For example, we've already collected a label for the "Invoice Number" Data Field for the "Factura" Document Type.
  2. The label Invoice Number is returned on the document for the label identifying the document's invoice number.

Labeling-behavior-how-to-field-extraction-02.png

  1. With the label collected, the set up for this "Invoice Number" Data Field will be much simpler.
  2. Notice the Value Extractor has been set to Labeled Value.
  3. The Label Extractor and Value Extractor are unconfigured (or "blank").
  4. However, upon testing extraction, the invoice number is collected.
    • All that was required, in this case was to collect the label and set the Data Field's Value Extractor property to Labeled Value. Magic!
    • Not magic. Label sets.
  5. With Labeling Behavior enabled and a label collected for the "Invoice Number" Data Field, the Labeled Value extractor's Label Extractor looks for a match for the collected label.
    • In this case Invoice Number.
  6. Furthermore, with Labeling Behavior enabled and a collected label utilized as the Label Extractor, the Labeled Value extractor's Value Extractor will still return a value even if left unconfigured.
    • It will look for the nearest simple segment according to the layout settings (the Maximum Distance and Maximum Noise property).
    • The result "IN165796" is indeed the nearest simple segment and the desired result. So, there is technically nothing else we need to do. However, situations are rarely this simple and straightforward. There are some other considerations we should keep in mind.

Labeling-behavior-how-to-field-extraction-03.png

While you can get a result without configuring the Labeled Value extractor's Value Extractor, that doesn't mean you should.

It is considered best practice to always configure the Value Extractor.

Best Practice Considerations

While you can get a result without configuring the Labeled Value extractor's Value Extractor, that doesn't mean you should. It is considered best practice to always configure the Value Extractor.

So, why is it considered best practice to do so. The short answer is to increase the accuracy of your data extraction. A simple segment could be anything. If you know the data you're trying to extract has a certain pattern to it, you should target that data according to its pattern. Dates, for example, follow a few different patterns. Maybe it's "07/20/1969" or "07-20-69" or "July 20, 1969", but you know it's a date because it has a specific syntax or pattern to it. To increase the accuracy of your extraction, you should configure the Value Reader with an extractor that returns the kind of data you're attempting to return.

We can see fairly quickly why leaving the Labeled Value extractor's Value Extractor unconfigured is not ideal.

  1. All the Data Fields in this Data Section have collected labels and are using the Labeled Value extractor.
    • Except the "Vendor Name" Data Field. Ignore this Data Field for the time being.
  2. We only get a few accurate results.
    • Without its Value Extractor configured, the Labeled Value extractor is going to grab whatever segment it can get. While it can be what you want, it is not necessarily what you want.
      • The Value Extractor will allow you to target more specifically what you want to return.
    • Furthermore, while the "Sales Tax" and "Invoice Amount" results may look accurate, they too are not. There are some OCR errors. The extracted segments "0,00" and "54.594.00" should be returned as "0.00" and "54,594.00".
      • The Value Extractor will also allow you to utilize Fuzzy RegEx, Lexicon lookups, output formatting, Data Type Collation methods and other extractor functionalities to manipulate, format, and filter results.
  3. For example, the "Date" Data Field returns the segment "Page" to the right of the label Date where it should be returning the date below it, "Feb 26, 2014".
    • If we were instead to configure the Labeled Value extractor's Value Extractor to only return dates, we'd get the more specific result we want and not the generic segment we don't.
    • FYI: When the Value Extractor property is left unconfigured in this manner, the Labeled Value extractor follows a "horizontal then vertical" order of operations. If both a Right Maximum Distance and a Bottom Maximum Distance are configured, it will look for results to the right of the label (aligned horizontally) before looking for results below the label (aligned vertically).

Labeling-behavior-how-to-field-extraction-05.png

  1. If we reconfigure this "Invoice Date" Data Field slightly we will get a much more accurate result.
  2. We've kept the Data Field's Value Extractor set to Labeled Value.
  3. The only thing we've changed is we've set the Labeled Value extractor's Value Extractor to a Reference extractor pointing to a Data Type returning dates.
  4. Upon testing extraction, we can see now the Data Field collects the value we want, the invoice's date "02/26/2014"
  5. By configuring the Labeled Value extractor's Value Extractor, it's no longer looking for just simple segments next to the label. So, the word "Page" is no longer returned. Instead, it's looking for results matching the Value Extractor's results.
    • This increases the specificity of what the Labeled Value returns. Increased specificity yields increased accuracy.

Labeling-behavior-how-to-field-extraction-06.png

Configuring the Labeled Value extractor's Value Extractor also gives you the myriad of functionalities available to extractors. For example, Fuzzy RegEx is one of the main ways Grooper gets around poor OCR data at the time of extraction. When the text data is just a couple characters off of the extractor's regex pattern, Fuzzy RegEx can not only match the imperfect data but "swap" the wrong characters for the right ones, effectively cleansing your result.

  1. Take the "Invoice Amount" Data Field for example.
  2. Here, the Data Field's Value Extractor is set to Labeled Value.
  3. And, the Labeled Value extractor's Value Extractor is left unconfigured.
  4. The Labeled Value extractor first locates the collected label Amount Due and without a configured Value Extractor returns the nearest text segment (according to the Maximum Distance settings).
  5. This is almost the result we want.
    • It's the "right" result in that, yes, that is the text segment that corresponds to the invoice amount due for this invoice.
    • But it's very much the wrong result in that the OCR text data is inaccurate. "54.954.00" is not a valid currency value. It should be "54,954.00" with the first period being a comma.

Labeling-behavior-how-to-field-extraction-07.png

However, that's just a single character off from being the right result. We could build an extractor to return currency values looking to make fuzzy swaps like this, both matching text that is slightly off and reformatting the result to match a valid currency format. If we used that extractor as the Labeled Value extractor's Value Extractor it would not only find the segment but also reformat the result, swapping the mis-OCR'd period for what it should be, a comma.

And we've done just that.

  1. Here, we've set the Labeled Value extractor's Value Extractor to reference a Data Type returning fuzzy matched currency values.
  2. The Value Extractor matches the text we want, below the label Amount Due
  3. And since the referenced extractor uses Fuzzy RegEx the returned result is now a valid currency value.
    • The result is now "54,594.00" instead of "54.594.00". The first period was swapped for a comma.

Labeling-behavior-how-to-field-extraction-08.png

Additional Considerations When Using Labeled Value with Label Sets

Custom Labels to Exclude Results

Continuing from the tutorial above's discussion of an unconfigured Labeled Value Value Extractor, let's examine the results of the "Purchase Order Number" Data Field.

  1. We've selected the "Purchase Order Number" Data Field in the Node Tree.
  2. The Data Field's Value Extractor property is set to Labeled Value.
  3. It currently does not have the Labeled Value extractor's Value Extractor configured.
  4. Left unconfigured, we get an undesirable result, a rather large text segment "Order Date Customer No. Salesperson Order No. Ship Via".

This is obviously not what we want. We want the purchase order number listed below it. Ultimately, we will follow best practice and configure the Labeled Value extractor's Value Extractor property.

However, before we do, this gives us an opportunity to demonstrate some additional functionality of the Labeling Behavior.

This data "Order Date Customer No. Salesperson Order No. Ship Via" is itself comprised of labels pointing to various values on the document. Even though we haven't set up Data Fields in this Data Model to capture the values they point to, we know this is data we don't want. In general, you don't want to use Grooper to extract labels, you want to extract values.

Labeling-behavior-how-to-field-extraction-09.png

What's happening here is Grooper is returning all the text on this single line until a collected label in this Document Type's label set is located. In this case, the label Terms was collected for the "Payment Terms" Data Field. None of the text between the label PO Number and the label Terms have been collected in the label set. So, the Labeled Value extractor returns all the text to the right of the "PO Number" Data Field's label (PO Number) and the next encountered label (Terms), resulting in "Order Date Customer No. Salesperson Order Number Ship Via".

This is very specific functionality to the Labeled Value extractor and its interaction with label sets. It will only behave this way if you:

  1. Are using the Labeling Behavior and the Data Field's Value Extractor is set to Labeled Value.
  2. Have collected other labels on the same line as the Data Field's label.
  3. Have not configured the Labeled Value extractor's Value Extractor.

This may be clearer if we add a Custom Label to the label set.

  1. Here's we've added a Custom Label Salesperson to the "Purchase Order Number" Data Field's parent Data Section's labels.
  2. Be aware, the Custom Label must be added to the Data Field's parent Data Element's labels in order for this to work. This will be either a Data Section if it is a child of a Data Section or the Data Model itself if it is not.
    • In this case, the "Purchase Order Number" is a child of the "Static Fields" Data Section. This is why we added the Custom Label to the Data Section's labels and not the Data Model.
  3. Now we have both a Salesperson label and a Terms label for this Document Type's label set.

Labeling-behavior-how-to-field-extraction-10.png

  1. Now, examine the difference in the "Purchase Order Number" Data Field's extraction result.
  2. It stops at the Custom Label we added, Salesperson
  3. Ultimately, returning everything between the Data Field's label PO Number and the next label to the right Salesperson
    • In other words, "Order Date Customer No."

FYI

Keep in mind this is very specific functionality to the Labeled Value extractor and its interaction with label sets. It will only behave this way if you:

  1. Are using the Labeling Behavior and the Data Field's Value Extractor is set to Labeled Value.
  2. Have collected other labels on the same line as the Data Field's label.
  3. Have not configured the Labeled Value extractor's Value Extractor.

Labeling-behavior-how-to-field-extraction-11.png

If we were to go one step further and add a Order Date Custom Label, we wouldn't get any result returned at all!

There is no text between the Data Field's label and another label in the label set, the Labeled Value Extractor will return absolutely nothing at all.

FYI

One last time, for emphasis...

Keep in mind this is very specific functionality to the Labeled Value extractor and its interaction with label sets. It will only behave this way if you:

  1. Are using the Labeling Behavior and the Data Field's Value Extractor is set to Labeled Value.
  2. Have collected other labels on the same line as the Data Field's label.
  3. Have not configured the Labeled Value extractor's Value Extractor.

Labeling-behavior-how-to-field-extraction-12.png

HOWEVER, this was not the right solution for this problem.

This was only an educational exercise to make you aware of how labels in a label set interact with the Labeled Value extractor when its Value Extractor is left unconfigured.


We should have followed our best practice advice and configured the Labeled Value extractor's Value Extractor. We did not really have to go through the trouble of adding a bunch of Custom Labels. With the Labeled Value extractor's Value Extractor configured, it's going to ignore this whole business of finding a nearby segment or returning text on a line up to the next label in a label set and more specifically return the data you want to target.

  1. Here, we have the Labeled Value extractor's Value Extractor configured to reference a Data Type returning various purchase order number formats.
  2. Even without adding all the extra Custom Labels, we get what we want. The "Purchase Order Number" Data Field collects the purchase order number on the document, "PO009845", upon testing extraction.

Labeling-behavior-how-to-field-extraction-13.png

Maximum Noise

The Maximum Noise property of the Labeled Value extractor controls the maximum number of "noise characters" allowed in the "bounding-region" of a label-value pair.

Now, what does that mean? Let's look at an example, using the "Remit Address" Data Field of our example Data Model.

  1. We've selected the "Remit Address" Data Field.
  2. The Data Field's Value Extractor is set to Labeled Value.
  3. The Labeled Value extractor's Label Extractor is left unconfigured.
    • The extractor will use the collected label for this Data Field for each Document Type.
  4. The Labeled Value extractor's Value Extractor is configured to reference a Data Type returning all addresses for this document set.
    • We've followed best practice here and assigned a Value Extractor. There's nothing wrong with the referenced Data Type (named "VAL - Address"). It returns the street address and city, state, zip code line for all addresses on these invoices.
  5. What we should get upon extracting the document is this:
    91 Vahlen Plaza
    Reston, VA 20191
  6. However, upon testing extraction. No result returns.

What gives? It has to do with these "noise characters" mentioned above.

Labeling-behavior-how-to-field-extraction-14.png

Noise characters are any letters and digits falling within the bounding region defined by a label value. For our example, the bounding region looks like this.

  1. The label, highlighted in blue, is established by the Labled Value extractor's Label Extractor result.
  2. The value, highlighted in green, is established by the Labeled Value extractor's Value Extractor result.
  3. The bounding region, highlighted in yellow, is the smallest rectangle which can enclose both the label and the value.

Labeling-behavior-how-to-field-extraction-15.png

The noise characters are any letters or numbers within this rectangle other than the label or the value.

The highlighted characters in the image would be the noise characters for our example.

The Maximum Noise property allows you to configure how many of these non-label and non-value characters should exist in the bounding box.

You don't typically expect to find a bunch of text between a label and a value. The Maximum Noise property acts as an additional filter to avoid returning results too far away from the label. Where the Maximum Distance filters out results that are physically a set distance from the label, the Maximum Noise filters results that have lots of text between them and the label. The default being 5, there can be a maximum of 5 letter or number characters between the label and value.

However, in our case, we have more than 5. We have 15 ("FacturaTechnolo").

  • Note: Our case assumes we only want to capture the street address and the city, state, zip line, not the receiver's name.

Labeling-behavior-how-to-field-extraction-16.png

FYI

Noise characters are only letters and digits.

Spaces, punctuation marks, and control characters are NOT considered noise characters, even if present in the bounding region.

With this in mind, all we need to do to the "Remit Address" Data Field to successfully collect the result at time of extraction is increase the number of allowable noise characters.

  1. Here, we've upped the Maximum Noise property to 25.
  2. Upon extraction, the Labeled Value counts the number of noise characters in the bounding region between the label and the value.
  3. If the number of noise characters is less than the Maximum Noise property's number, the result is returned.
    • 15 is less than 25. Therefore, the result is returned.

Labeling-behavior-how-to-field-extraction-17.png

Footer Labels

For Data Field objects, you can collect both a "Header Label" as well as a "Footer Label". As we've seen the Header Label is the text label for whatever field you're trying to extract. Essentially, the text label marks the beginning of the field's content.

The Footer Label is an optional label used to mark the end of the field's content. The Footer Label is useful when leaving the Labeled Value extractor's Value Extractor unconfigured. While it is still always considered best practice to configure the Labeled Value extractor's Value Extractor, there are certain types of data that are difficult to match with regular expression. For example, a person's name. In these types of situations where you must run the Labeled Value extractor without a Value Extractor, a Footer Label can often aid you in throwing out false positive or "junk" data.

The following example is manufactured to demonstrate this concept. Let's say we're using Label Sets to extract the "Settlement Agent".

We would create a Data Field and collect a label for Settlement Agent. We would then set that Data Field's Value Extractor' to Labeled Value.

In the case of this document, we would get the result we wanted. The Labeled Value extractor's Label Extractor would match the collected label (stroked in blue). If left unconfigured, its Value Extractor would return the nearest segment to that label's location (according to its layout settings and operation discussed previously in this tutorial). This is exactly what we want, the highlighted name "Jourdain Meardon".

Labeling-behavior-how-to-field-extraction-footer-01.png

However, what if that value is not present on another document? Such is the case in this image.

In that case, the extractor is still going to look for the nearest segment. Depending on the layout settings, you might return "Seller" or you might return "File #". Both of those are segments. However, they are both the wrong result. The correct value in this case is nothing at all.

Labeling-behavior-how-to-field-extraction-footer-02.png

With a Footer Label, we can change how the Labeled Value operates when its Value Extractor is left unconfigured.

If we collect Seller for the "Settlement Agent" Data Field's Footer Label (stroked in red), we will restrict Labeled Value to only return text between the Header and Footer Labels (highlighted in yellow). With no text falling between the header and footer, the false positives will not return. In fact, no value will return at all!

Labeling-behavior-how-to-field-extraction-footer-03.png

Here, we've tested extraction with only the Header Label assigned for the "Settlement Agent" Data Field.

  1. The Data Field's Value Extractor is set to Labeled Value
  2. The Label Extractor is unconfigured.
  3. Left unconfigured, it matches the collected label on the document: "Settlement Agent"
  4. The Value Extractor is unconfigured.
  5. Left unconfigured, it matches the first characters horizontally aligned with the label, up to the Maximum Distance set (in this case the default of 2in).
  6. This returns "Sell".

This is junk data. There is no settlement agent listed on the document. No value should be returned.

Labeling-behavior-how-to-field-extraction-footer-04.png

We will add a Footer label to prevent this junk data from returning.

  1. Navigate to the Content Model in the Node Tree.
  2. Select the "Labels" tab.
  3. Select the Data Element to which you wish to add a Footer Label.
  4. Select the "Footer" tab.
  5. Collect the label as discussed in the Collect Label Sets section of this article.
    • In this case, the label Seller.

If present on the document, we expect the settlement agent's name to be between the label Settlement Agent (the Header Label) and Seller (the Footer Label).

Labeling-behavior-how-to-field-extraction-footer-05.png

When we test extraction for the "Settlement Agent" Data Field now, we get very different results.

  1. With a Footer Label added, and the Labeled Value extractor's Value Extractor unconfigured...
  2. ...the extractor will only return text between the Header Label and the Footer Label.
    • In our case, only text between Settlement Agent and Seller.
  3. With no text falling between the Header and Footer Labels, nothing is returned.

Labeling-behavior-how-to-field-extraction-footer-06.png

This is very specific functionality to the Labeled Value extractor and its interaction with Label Sets. It will only behave this way if you:

  1. Are using the Labeling Behavior and the Data Field's Value Extractor is set to Labeled Value.
  2. Have collected both a Header Label and a Footer Label for the Data Field.
  3. Have not configured the Labeled Value extractor's Label Extractor or Value Extractor.

Using Static Labels for Data Field Extraction

Collecting Static Labels

The Data Field elements have a unique label option, the Static label. This label option is useful for situations where the label itself is what you want to extract.

  1. For example, we have a Data Field in this Content Model's Data Model to collect the vendor's name for the invoice.
    • However, there isn't necessarily a label for the vendor's name like there is for other data points on the document.
  2. The purchase order number has a label, "PO Number", pointing to that data, "PO0009845".
  3. However, there is no such label for this invoice document's vendor name, "Factura Technology Corp".
    • But that's the data we want. The name itself. If these invoices are always classified as "Factura" Document Types, they're always going to have this text, "Factura Technology Corp". That's the vendor's name, and that's the data we want.

Labeling-behavior-how-to-field-extraction-18.png

What we really want to do is collect a piece of information that is the same for every single document of one Document Type. We expect the vendor's name "Factura Technology Corp" to be present for every document assigned the "Factura" Document Type during classification. Furthermore, we always expect it to be "Factura Technology Corp" and not something else.

Therefore, the vendor's name is "static" for the Document Type. It's present on every Document Type and the same value for every Document Type. You know what else is static on structured and semi-structured forms? Labels! Just in this case the label "Factura Technology Corp" is itself the value we want to return.

This is what a Static label is for.

  1. To add a Static label, select a Data Field in the Document Type's Data Model.
    • Here, we've selected a "Factura" document in the Batch and have selected the "Vendor Name" Data Field.
  2. Select the Static tab.
    • FYI: Only the Data Field Data Element has the option for a Static label.
  3. Collect the label you wish to collect for the Data Field.
    • Using one of the three label collection methods: 1) Type it into the text editor. 2) Lasso the label on the Document Viewer with the "Select Region" button. 3) Double-click the label segment on the Document Viewer with the "Select" button.

Labeling-behavior-how-to-field-extraction-19.png

Returning the Static Label

Now that the Static label is collected, how does Grooper know to return it during extraction when the Extract activity runs? The short answer is the Labeled Value extractor type will do this for us.

With "Factura Technology Corp" collected as a Static label, and the "Vendor Name" Data Field configured to utilize the Labeled Value extractor, it will return the Static label itself as the result.

  1. Here, we have the "Vendor Name" Data Field selected in the Node Tree.
  2. The Data Field's Value Extractor property is set to use the Labeled Value extractor type.
  3. The Labeled Value extractor's Label Extractor and Value Extractor are both unconfigured.
  4. With this Labeled Value configuration, and a Static label collected for this Data Field, the Static label is itself what the extractor is looking for on the document.
  5. If present, it will be returned and collected at time of extraction when the Extract activity runs.

Labeling-behavior-how-to-field-extraction-20.png

Label Sets and the Label Match Extractor Type

About Label Match

The Label Match extractor is extremely similar to the List Match extractor in that it matches one or more items in a defined list. However, it is designed specifically to work with the Labeling Behavior functionality. It will use the fuzzy extraction and vertical and constrained wrapping settings defined on the Content Model if a Labeling Behavior is enabled. This way, you can have a single, unified set of fuzzy match settings for multiple extractors. Rather than configuring these settings, including the confidence score threshold and fuzzy weighting, for multiple extractors, you can configure them just once when enabling the Labeling Behavior and all Label Match extractors will use them.

  • For more information on fuzzy extraction, visit the Fuzzy RegEx article.

For the Label Match extractor to return a result, two conditions must be met.

  1. The document folder must be classified.
    • In other words, it must have a Document Type assigned to it.
  2. That Document Type must have a Labeling Behavior enabled.
    • Either on the Document Type or, more typically, its parent Content Model.

Label Match Example

  1. In this example, a Value Reader is configured to return a small list of field labels on an invoice, using the Label Match Extractor Type
  2. Label Match is selected as the Extractor Type
  3. The list is entered in the Local Entries editor (just like you do with the List Match extractor).
    • Or, you can reference a Lexicon of list items using the "Properties" tab.
  4. The Prefix and Suffix Patterns are entered here.
    • ^|[^\w] is the default Prefix Pattern.
    • $|[^\w] is the default Suffix Pattern.
  5. The document we have selected is classified as an "Invoice" Document Type.
  6. This is a Document Type in the Content Model with the Labeling Behavior enabled.
  7. Upon execution, notice some results are returned with a confidence below 100%.
    • This is due to the fuzzy matching settings configured from the Labeling Behavior. The Label Similarity property was set to 90%. Any items in the list with a fuzzy matching similarity score above 90% are returned. Any falling below 90% (for example the list item CALLER:) are not.
    • Note this means changing the Labeling Behavior settings will impact ALL Label Match extractors for the Content Model's Document Types.

Value-reader-extractor-types-label match-02-v2.png

Where are these Labeling Behavior settings again?

  1. The Content Model selected here, has enabled a Labeling Behavior.
  2. Labeling Behavior is enabled using the Behaviors property...
  3. ...and added using the collection editor seen here, as discussed earlier in this article.
  4. The Label Match extractor will use all the fuzzy extraction and text wrapping settings defined here.

Value-reader-extractor-types-label match-01.png

Use Label Sets for Tabular Extraction

Label Sets and the Tabular Layout Method

Label Sets and Tabular Layout

Many tables label the columns so the reader knows what the data in that column corresponds to. How do you know the unit price for an item on an invoice? Typically, that item is in a table and one of the columns of that table is labeled "Unit Price" or something similar. Once you read the labels for each column (also called "column headers"), you the reader know where the table begins (below the column headers) and can identify the data in each row (by understanding what the column headers refer to).

This is also the basic idea behind the Tabular Layout Extraction Method. It too utilizes column header labels to "read" tables on documents, or at least as the step number one in modeling the table's structure so that Grooper can extract data from each cell in the table.

Furthermore, using the Tabular Layout method, collected label sets using a Labeling Behavior can also be used to extract data from tables on documents. In this case, the labels collected for the Data Column children of a Data Table are utilized to help model the table's structure.

Once the column header locations are established, the next requirement is a way to understand how many rows are in the table. This is done by configuring at least one Data Column's Value Extractor property. Generally, there is at least one column in a table that is always present for every row in the table. If you can use an extractor to locate that data below its corresponding column header, that gives you a way of finding each row in the table.

And last there are a few other considerations you might need to make. Is every row in the table a single line or are the rows "multiline"? Do you need to clean up the data the Tabular Layout initially extracts for a column by normalizing it with an extractor? Do you need to establish a table "footer" to limit the number of rows extracted?

This tutorial will cover the basic configuration of the Tabular Layout Extraction Method using collected Label Sets and address a few of these considerations.

The basic steps will be as follows:

  1. Collect labels.
    • At minimum you must collect a header label for each Data Column child in the Data Table. We will also discus the benefits of collecting label for the full header row.
  2. Assign a Value Extractor for at least one Data Column.
    • We always expect to find a quantity for each line item in the invoice. There's always a "Quantity" column. This data is also present on every row. This will provide the information necessary to find each row in the table.
    • We will also discus why you might configure the Value Extractor property on additional Data Columns as well.
  3. Set the Data Table object's Extract Method property to Tabular Layout
  4. Test to ensure the table's data is collected.

In a perfect world, you're done at that point. As you can see in this example, we've populated a table. Data is collected for all four Data Columns for each row on the document.

However, the world is rarely perfect. We will discuss some further configuration considerations to help you get the most out of this table extraction method in the "Additional Considerations" section below.

Vertical-wrap-about-06.png

Collect Labels

See the above how to (Collect Label Sets) for a full explanation of how to collect labels for Document Types in a Content Model. The following tutorial will presume you have general familiarity with collecting labels.

As far as strict requirements for collecting labels for tabular data extraction goes, you must at minimum collect a label for each Data Column you wish to extract.

For this "Stuff and Things" Document Type, one column header label has been collected for each of the four Data Column children of the "Line Items" Data Table.

  1. The label Quantity for the "Quantity" Data Column
  2. The label Description for the "Description" Data Column
  3. The label Unit Price for the "Unit Price" Data Column
  4. The label Total for the "Line Total" Data Column

Labeling-behavior-about-30.png

You may optionally collect a label for the entire row of column header labels. This label is collected for the parent Data Table object's label.

  1. The label Quantity Item Serial Number Description Unit Price Total for the "Line Items" Data Table

It is generally considered best practice to capture a header row label for the Data Table. But if it's optional, why do it? What is the benefit of this label?

Labeling-behavior-how-to-table-extraction-03.png

The answer has to do with imperfect OCR text data and Fuzzy RegEx. Fuzzy RegEx provides a way for regular expression patterns to match in Grooper when the text data doesn't strictly match the pattern. The difference between the regex pattern Grooper and the character string "Gro0per" is just off by a single character. An OCR engine misreading an "o" character for a zero is not uncommon by any means, but a standard regex pattern of Grooper will not match the string "Gro0per". The pattern expects there to be an "o" where there is a zero.

Using Fuzzy RegEx instead of regular regex, Grooper will evaluate the difference between the regex pattern and the string. If it's similar enough (if it falls within a percentage similarity threshold) Grooper will return it as a match.

  • FYI "similarity" may also be referred to as "confidence" when evaluating (or scoring) fuzzy match results. Grooper is more or less confident the result matches the regex pattern based on the fuzzy regex similarity between the pattern and the imperfect text data. A similarity of 90% and a confidence score of 90% are functionally the same thing (One could argue there is a difference between these two terms when Fuzzy Match Weightings come into play, but that's a whole different topic. And you may encounter Grooper users who use the terms "similarity" and "confidence" interchangeably regardless. Visit the Fuzzy RegEx article if you would like to learn more).

So how does this apply to the Data Table's column header row label? The short answer is it provides a way to increase the accuracy of Data Column column header labels by "boosting" the similarity of the label to imperfect OCR results.

  1. For example, examine the collected label for the "Description" Data Column.
    • Notice the label Description is highlighted red. The label doesn't match the text on the document.
  2. This is due to imperfect OCR results.
    • The label should read "Description" but OCR made some missteps and recognized that segment as "DescripUon".
    • The "ti" in "Description" were recognized as a capital "U". This means "Description" is two characters different from "Description" or roughly 82% similar. The Labeling Behavior's similarity threshold is set to 90% for this Content Model. 81% is less than 90%. So, the result is thrown out.
      • FYI, this threshold is configured when the Labeling Behavior is added using the Behaviors property of a Content Model. The Label Similarity property is set to 90% by default, but can be adjusted at any time by configuring the Labeling Behavior item in the Behaviors list.

As we will see, capturing the full row of column header labels will boost the similarity, allowing the label to match without altering the Label Behavior's fuzzy match settings.

Labeling-behavior-how-to-table-extraction-04.png

First, notice what's happened when we lassoed the row of column header labels.

  1. Some of the labels are off. "oty." should read "Qty." and "DescripUon" should read "Description".
  2. It's because that's what's in the document's text. When you lasso a label, it's going to grab whatever OCR text data was generated from the Recognize activity (or native text for digital documents).
  3. And, our "Description" Data Field's label still isn't matching.
    • But keep your eye on the birdie.

Labeling-behavior-how-to-table-extraction-05.png

  1. Notice what happens when we spell-correct the lassoed label, typing "Qty." instead of "oty." and "Description" instead of "DescripUon".
  2. Now the label matches. MAGIC!

Not magic. Just math.

The Data Table's column header row label is much much longer than a single Data Column's column header label. There are just more characters in "Qty. Qty. Item Number Description Unit Price Extended Price\r\nOrd. Shp." than "Description" (70 vs 11). Where the "Description" Data Column's label is roughly 82% similar to the text data (9 out of 11 characters), the "Line Item" Data Table's label, comprised of the whole row of column labels, is roughly 96% similar to the text data (67 out of 70 characters).

Utilizing a Data Table label allows you to hijack the whole row's similarity score when a single Data Column's similarity threshold. If the label can be matched as a part of the larger whole, its confidence score goes up much further than by itself. The Data Table's larger label of the full row of column labels gives extra context to the "Description" Data Column's label, providing more information about what is and is not an appropriate match.

So why is it considered best practice to capture a label for the Data Table? OCR errors are unpredictable. The set of examples you worked with when architecting this solution may have been fairly clean with good OCR reads. That may not always be the case. Capturing a Data Table label for the column label row will act as a safety net to avoid unforeseen problems in the future.

Labeling-behavior-how-to-table-extraction-06.png

Assign a Data Column's Value Extractor

Step 1 is done. We've collected labels for the "Line Item" Data Table and its Data Columns for each Document Type in this Content Model. Step 2 is configuring and assigning a Value Extractor for at least one Data Column.

Why is this necessary? Think about what we've done so far. We've collected labels for the Data Columns. Grooper now has a way to figure out where the columns are on the document. But what does it know about the rows?

Rows come under columns. We know that much. So, Grooper at least knows to look for rows underneath the collected Data Column labels. But that's about it. It doesn't know the size of each row. It doesn't know the spacing between the rows. Probably most importantly, it doesn't know how many rows there are. Tables tend to be dynamic. They may have 3 rows on one document and 300 on the next. Grooper needs a way of detecting this.

Indeed, if we were to test extraction with just labels collected, we would not get any result whatsoever.

  1. FYI you can test data extraction directly from the Labels UI using the "Test" button.
  2. This will create a new "Results" tab, showing you a preview of the results the Extract activity collects from the selected document folder, as defined by its Document Type's Data Model.
  3. As you can see, we get no extraction results for the "Line Item" Data Table.

Labeling-behavior-how-to-table-extraction-06.png

This is why we need a Data Column's Value Extractor property configured, to give the Extract activity an awareness of the rows beneath the column labels.

The key thing to keep in mind is this data must be present on every row. You'll want to pick a column whos data is always present for every row, where it would be considered invalid if the information wasn't in that cell for a given row.

In our case, we will choose the "Quantity" Data Column. We always expect there to be a quantity listed for the line item on the invoice, even if that quantity is just "1".

  1. We will select the "Quantity" Data Column in the Node Tree.
  2. We will configure the Value Extractor to return the numerical quantity listed for every line item on every row of the table.
    • We will keep this fairly simple for demonstration purposes, using a Pattern Match extractor.

Labeling-behavior-how-to-table-extraction-07.png

This is the pattern we will use for the "Quantity" Data Column's Value Extractor.

  1. The regex is a fairly simple pattern to match generic quantities.
    • It'll match one to three digits with an optional decimal followed by zero to four digits. And, that must be surrounded by a space character before and after.
  2. As you can see, we get two results below the "Quantity" label. We should then get two rows when this table extracts.

We get a bunch of other hits as well. This is a very generic extractor matching very generic numerical data.

  1. Will this result present a problem? Will we get an extra row for its result?
    • No. That result is above the label collected for the Data Column. The Tabular Layout method presumes rows are below column labels. Any result above them will be ignored.
  2. What about results like these? Will this present problem?
    • The short answer is no. This result is misaligned with the "Quantity" Data Column's header. It's too far to the right to be considered "under" it and will be ignored as a candidate to produce a row.
    • That said, when you are building your own Data Column extractors, do pay more attention to results below the column header row. They have the most potential to produce false positive results, producing erroneous rows.

Labeling-behavior-how-to-table-extraction-08.png

For fairly simple table structures we now have the two things the Tabular Layout method needs to extract data:

  1. Collected labels for the Data Column labels (and optionally the whole row of column labels for the Data Table)
  2. Configured at least one Data Column with its Value Extractor configured.

Now, all we need to do is tell Data Table object we want to use the Tabular Layout method. We do this by setting its Extract Method property to Tabular Layout.

Set Extract Method to Tabular Layout and Test

A Data Table's extraction method is set using the Extract Method property. To enable the Tabular Layout method, do the following.

  1. Select a Data Table object in your Data Model.
    • Here, we've selected the "Line Items" Data Table.
  2. Select the Extract Method property.
  3. Using the dropdown menu, select Tabular Layout

Labeling-behavior-how-to-table-extraction-09.png

Now, let's test out what we have and see what we get!

  1. For the selected document folder in the "Batch Viewer" window...
  2. Press the "Test Extraction" button.
    • Side note: We've seen before we can test extraction using the "Labels" tab of a Content Model or Document Type when Labeling Behavior is enabled. The only real difference is we're testing extraction for the specific Data Element selected in the Node Tree. In this case the "Line Items" Data Model. The "Test" button in the "Labels" tab will test extraction for the entire Data Model and all its component child Data Elements. However, feel free to test extraction at either location. The end result is the same. We're testing to verify extraction results.
  3. The results show up in the "Data Element Preview" window.

For the Tabular Layout method, the Data Table is populated using primarily two pieces of information.

  1. The location and width of the Data Column header labels.
    • This determines the width of the cells for each column.
    • Side note: The width of the column cells is actually determined differently depending on if the table has lines. If the table has lines (as it does in this example) and those lines were previously detected via a Line Detection (or Line Removal) IP Command, the cell width will be expanded to the boundaries of the lines. Table lines give human readers an indicator of where the data "lives" (or is contained). If it's in the box, it belongs to the column. If it's out of the box, it belongs to a different column.
  2. The number of rows as determined by the Data Columns whose Value Extractor property is configured.
    • One row is established for each result the Value Extractor returns.

Labeling-behavior-how-to-table-extraction-10.png

With these pieces of information, the Tabular Layout method can start to determine the table's structure. If you know where the columns are and how big they are, and you know how many rows there are, you pretty much know what the table looks like.

This allows Grooper to create data instances for each cell in the table.

  1. Once the Tabular Layout method establishes the boundaries of each cell, Grooper "knows" where the table data is located on the page.
  2. The text data (either OCR'd text or native digital text obtained from the Recognize activity) is extracted from each cell instance, populating the Data Table and collecting these results when the Extract activity runs.
    • This is for extremely basic configurations, there are some more advanced configuration techniques to either adjust the size of the cell instances and/or extract data for each cell. Some of these will be discussed in the #Additional Considerations section below.

Labeling-behavior-how-to-table-extraction-11.png

Additional Tabular Layout Considerations

Multiline Rows

Footer Labels

Data Column Value Extractors

Label Sets and the Row Match Method

Label Sets and the Fluid Layout Method

Use Label Sets for Sectional Extraction

Label Sets and the Transaction Detection Method

Label Sets and the Nested Table Method

Additional Information

Label Layout Options

Footer Labels

Setting Data Element Overrides in the Labels UI

Version Differences

2021

The Labeling Behavior is brand new functionality in Grooper version 2021. Prior to this version, its functionality may have been able to be approximated by other objects and their properties (For example, a Data Type using the Key-Value Pair collation is at least in some ways similar to how the Labeled Value Extractor Type works). However, creation of label sets using Document Types and their implementation described above was not available prior to version 2021.